The goal of this project is to investigate the prevalence and risk factors of lung transplantation in the United States using publicly available data from the United Network for Organ Sharing (UNOS). We will also investigate how the coefficients of the lung allocation score (LAS; a prediction model used to prioritize patients for lung transplant) have changed over time. Finally, we will assess how well the current LAS model predicts post-transplant survival, and compare its performance to that of a logistic regression model which allows the coefficients of important predictors to vary with time.
Given the interdisciplinary nature of this project, I sought advice from representatives from the fields of epidemiology (Dr. Stephen Kimmel, MD, MSCE), statistics (Dr. Alisa Stephens-Shields, PhD), sociology (Dr. Julia Szymczak, PhD), and lung transplant surgery (Dr. Edward Cantu III, MD, MSCE). As explained further in the introduction, these individuals helped me focus my project on examining how the coefficients of the LAS model might change over time, and whether allowing for such changes improves this model’s performance. All information contained in this report is available on my final project GitHub repository.
Please note that while the UNOS data is publicly available, researchers must submit a Data Request prior to using it. I have obtained access to this data through my advisor, Dr. Stephen Kimmel. For the purposes of this project, however, I have provided a simulated data set on my GitHub repository. This data set was simulated based on the observed UNOS data. The Rmd file that appears on my final project repository will run using this simulated data set.
In May 2005, the Organ Procurement and Transplantation Network (OPTN) modified their lung transplantation policy from one in which potential recipients were prioritized based on the amount of time spent on the waiting list to one in which donor lungs were allocated to recipients based on a lung allocation score (LAS) [1,2]. The LAS essentially compares patients’ transplant benefit to their waiting list urgency, assigning individuals with higher values greater priority for transplant. Since the adoption of the LAS, studies have shown that the mortality rate for patients on the lung transplant waiting list has decreased [2]. However, resource use and geographic and gender disparities in lung allocation have increased [3-6]. Additionally, while the statistical models used to construct the LAS control for patients’ demographics, diagnoses, and laboratory values, the model itself is static in nature. That is, a derivation cohort is assembled, model coefficients are estimated based on this cohort, and then those coefficients are applied to individuals outside the cohort. To date, the LAS coefficients have been updated using three different cohorts at three different times: 2005, 2010, and 2015 [7]. Yet no one has investigated how LAS coefficients have changed over time, or whether such changes impact the predictive accuracy of the LAS.
In this project, we aim to explore the prevalence and risk factors of lung transplantation in the United States using publicly available data from the United Network for Organ Sharing (UNOS) collected between 2007 and 2015. We will also investigate how the coefficients of the lung allocation score (LAS; a prediction model used to prioritize patients for lung transplant) have changed over time. Finally, we will assess how well the current LAS model predicts post-transplant survival, and compare its performance to that of a logistic regression model which allows the coefficients of important predictors to vary with time.
This problem is relevant to many different stakeholders, including epidemiologists, statisticians, clinicians, and sociologists. Epidemiologists are interested in characterizing how both the pre- and post-transplant mortality rate of individuals with chronic lung diseases changes over time, as well as identifying whether certain risk factors are associated with mortality. Statisticians are interested in developing tools that allow prediction models (such as the LAS) to be updated more easily and frequently, as “dynamic” modeling strategies might improve the performance of prediction models in real-world settings. Clinicians are interested in predicting pre- and post-transplant survival accurately so that donor lungs are allocated to the appropriate patients in the appropriate order. Finally, sociologists are interested in understanding the role such prediction models play in clinical decision making.
Given the interdisciplinary nature of this project, I sought advice from representatives from each of these fields: Dr. Stephen Kimmel, MD, MSCE, a cardiovascular epidemiologist with experience in the development and validation of clinical prediction models; Dr. Alisa Stephens-Shields, PhD, a biostatistician with expertise in observational studies, longitudinal data, and causal inference methods; Dr. Julia Szymczak, PhD, a medical sociologist with expertise in using qualitative methods to understand medical culture and practice; and Dr. Edward Cantu III, MD, MSCE, a lung transplant surgeon at the Hospital of the University of Pennsylvania. Together, these individuals have suggested that I: 1) compare the observed mortality outcomes for patients who received lung transplants to the mortality outcomes that were predicted by the existing lung allocation score (LAS) algorithm; 2) compare the observed mortality outcomes to those predicted by a logistic regression model which allows for coefficients to change over time; 3) think beyond traditional measures of predictive accuracy, such as calibration and discrimination, to consider the target parameter that the LAS algorithm purports to estimate versus the target parameter that the LAS algorithm actually estimates; 4) think about how the current LAS model is used in clinical practice, and whether modifications to this model would influence how it is used; and 5) consider looking into disparities in transplantation among different patient populations (e.g., patients with different ages, race/ethnicity, or diagnoses).
The United Network for Organ Sharing (UNOS) is a private organization charged by the Organ Procurement and Transplantation Network (OPTN) to coordinate the allocation of donor organs in the United States. Toward this end, UNOS has developed a database to keep track of all potential transplant recipients. By law, every patient who would like to receive a transplant is required to be registered on the UNOS waiting list. UNOS then follows these patients until they either 1) receive transplant; 2) die; or 3) are removed from the waiting list (i.e., censored) due to illness or loss to follow up. The database contains information on patients’ lab values, demographic characteristics, and any predictive scores used to aid the allocation process (e.g., the Lung Allocation Score (LAS) for allocating lungs; the Model for End Stage Liver Disease (MELD) score for allocating livers). This data is publicly available, although researchers must submit a Data Request prior to using it. (As an aside, I have obtained access to this data through my advisor, Dr. Stephen Kimmel. For the purposes of this project, however, I have provided a simulated data set on my GitHub repository.)
In this study, we are most interested in examining how well the current Lung Allocation Score (LAS) predicts post-transplant survival among adults who actually receive transplant. Thus, our study population consists of all patients 18 years or older who are listed for and received a single or bi-lateral lung transplantation in the United States between January 1, 2007 and December 31, 2015. This date range was chosen because it ensures that no patient experiences any person-time prior to the implementation of the LAS (i.e., prior to May 2005); it also guarantees that individuals experience enough follow-up time to power our analyses. Furthermore, while the analyses presented here are restricted to the post-transplant population, future research will explore the effect of the LAS among the full population of all individuals regsitered on the UNOS waiting list.
The post-transplant portion of the LAS aims to estimate the 1 year post-transplant survival probability for patients who receive transplant. The predictors of this model include: age at transplant (\(years\)), cardiac index (\(L/min/m^2\)), use of mechanical ventilation (\(yes/no\)), serum creatinine (\(mg/dL\)), creatinine increase of at least 150%, diagnosis group (i.e., obstructive diseases, pulmonary hypertension, cystic fibrosis, and pulmonary fibrosis), functional status (i.e., requires assistance to perform daily tasks, \(yes/no\)), oxygen requirement (\(L/min\)), and six-minute walk distance (\(feet\)). The outcome of this model is a binary indicator of mortality (i.e., \(1=died, 0=alive\)). Some of these variables are already present in the UNOS data, while others must be calculated. Thus, we begin by loading the (simulated) post-transplant UNOS data and center, truncate, and/or recode variables as appropriate. More specifically, we:
Restrict our population to only include data falling between 2007 and 2015 (note that the post-transplant data set has already been restricted to adults awaiting lung transplant).
Center the age and six-minute walk distance variables.
Trim cardiac index, creatinine, and six-minute walk values above the 99th percentile to mitigate the impact of extreme values.
Compute the creatinine increase variable.
Dichotomize the mechanical ventilation and functional status variables.
# Attach relevant packages on which this code relies
library(stats)
library(ggplot2)
library(gcookbook)
library(survival)
library(gdata)
library(gtools)
library(gmodels)
library(gplots)
library(cmprsk)
library(ipw)
library(readxl)
library(plyr)
library(dplyr)
library(zoo)
library(gsubfn)
library(tidyr)
library(pastecs)Please note that the cleaning code below uses last-observation-carried-forward (LOCF) to fill in missing values of some variables in the raw UNOS data (within individual patients). It requires the use of additional UNOS data (i.e., a “history file” which includes multiple records for each subject), and takes approximately 20 minutes to run. It is not necessary to run this section of code when using the simulated data set, because the simulated data set contains no missing data. In other words, when using simulated data, you should set the options for this code chunk to “eval=FALSE”.
# Load the raw data sets
LAS_hist <- read_dta("K:/Data/UNOS-Lung/THORACIC_LAS_HISTORY_DATA.DTA")
LAS_audit <- read_dta("K:/Data/UNOS-Lung/THORACIC_LAS_AUDIT_DATA.DTA")
thoracic <- read_dta("K:/Data/UNOS-Lung/THORACIC_DATA.DTA")
# Load the temp data set from Michael's Stata code
UNOSclean <- read_dta("K:/Data/_temp.dta")
# Format the variables
UNOSclean$pstatus <- as.numeric(UNOSclean$pstatus)
UNOSclean$age <- as.numeric(UNOSclean$age)
UNOSclean$white <- as.numeric(UNOSclean$ethcat %in% 1)
UNOSclean$race <- factor(UNOSclean$ethcat)
levels(UNOSclean$race) <- c("White","Black","Hispanic","Others","Others","Others","Others","Others")
# Patient's variables
UNOSclean$gender <- factor(UNOSclean$gender, labels=c("Female","Male"))
UNOSclean$reg <- factor(UNOSclean$region)
UNOSclean$insur <- factor(UNOSclean$pri_payment_trr)
levels(UNOSclean$insur) <- c("Private","Medicaid","Medicare","Medicare","Medicaid","VA/Gov","VA/Gov","Other","Other","Other","VA/Gov")
UNOSclean$insur2 <- as.character(UNOSclean$insur)
UNOSclean$insur2[is.na(UNOSclean$insur)] <- "Unknown"
UNOSclean$insur2 <- factor(UNOSclean$insur2)
UNOSclean$insur2 <- reorder(UNOSclean$insur2, new.order=c(4,2,1,6,3,5))
UNOSclean$aboc <- factor(UNOSclean$abo)
levels(UNOSclean$aboc) <- c("A","A","A","A","AB","AB","B","O")
UNOSclean <- UNOSclean[UNOSclean$txed==1,]
#UNOSclean$tx <- factor(UNOSclean$tx_type, labels=c("BLT","SLT"))
UNOSclean$las <- UNOSclean$end_match_las
UNOSclean$bmic <- factor(cut(UNOSclean$bmi_calc, c(12.5,18.5,25,30,35,45), right=F), labels=c("<18.5","18.5-25","25-30","30-35","35+"))
UNOSclean$dongender <- factor(UNOSclean$gender_don, labels=c("Female","Male"))
UNOSclean$donrace <- factor(UNOSclean$ethcat_don)
levels(UNOSclean$donrace) <- c("White","Non-white","Non-white","Non-white","Non-white","Non-white","Non-white")
UNOSclean$donpulmi <- factor(UNOSclean$pulm_inf_don, labels=c("No","Yes"))
UNOSclean$donsmoke2 <- factor(UNOSclean$hist_cig_don)
levels(UNOSclean$donsmoke2) <- c("Unkown","No","Unkown","Yes")
UNOSclean$donsmoke2 <- reorder(UNOSclean$donsmoke2, new.order=c(2,3,1))
UNOSclean$donsmoke <- UNOSclean$donsmoke2
levels(UNOSclean$donsmoke) <- c("No","Yes",NA)
UNOSclean$dondrug2 <- factor(UNOSclean$hist_oth_drug_don)
levels(UNOSclean$dondrug2) <- c("Unkown","No","Unkown","Yes")
UNOSclean$dondrug2 <- reorder(UNOSclean$dondrug2, new.order=c(2,3,1))
UNOSclean$dondrug <- UNOSclean$dondrug2
levels(UNOSclean$dondrug) <- c("No","Yes",NA)
UNOSclean$pO2 <- factor(cut(UNOSclean$po2, c(0,300,770), right=F), labels=c("Low pO2","Normal pO2"))
UNOSclean$pO2 <- reorder(UNOSclean$pO2, new.order=c(2,1))
UNOSclean$caudec <- factor(UNOSclean$cod_cad_don, labels=c("Anoxia","CVA/Stroke", "Head trauma","CNS tumor","Other"))
UNOSclean$dx <- factor(UNOSclean$diag_group, labels=c("Obstructive dx", "Pulm Htx", "CF", "Pulm fibrosis"))
UNOSclean$ytx <- as.numeric(as.character(UNOSclean$yr_trxp))
UNOSclean$calc_vent_use <- as.factor(UNOSclean$calc_vent_use)
# Remove 2006 data due to incompleteness,
# remove 2016 data due to the fact that nearly 50% of patients have follow-up times < 1 year,
# center the resulting year around 2007
UNOSclean2 <- subset(UNOSclean, !(UNOSclean$ytx %in% c(2006, 2016)))
UNOSclean2$ytx_cent <- UNOSclean2$ytx - 2007
# Use Last Observation Carried Forward on the LAS history file to fill in missing values (within individuals)
# See https://stackoverflow.com/questions/23818493/carry-last-observation-forward-by-id-in-r
LAS_hist2 <- transform(LAS_hist, ci = fn$ave(ci, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, calc_o2 = fn$ave(calc_o2, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, calc_vent_use = fn$ave(calc_vent_use, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, creatinine = fn$ave(creatinine, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, serum = fn$ave(serum, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, func_stat = fn$ave(func_stat, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, calc_func_stat = fn$ave(calc_func_stat, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
LAS_hist2 <- transform(LAS_hist2, calc_six_min_walk = fn$ave(calc_six_min_walk, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
# Use Last Observation Carried Forward on the Thoracic file to fill in missing values (within individuals)
thoracic2 <- transform(thoracic, age = fn$ave(age, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
thoracic2 <- transform(thoracic2, init_creat = fn$ave(init_creat, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
thoracic2 <- transform(thoracic2, end_creat = fn$ave(end_creat, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
#thoracic2 <- transform(thoracic2, most_rcnt_creat=fn$ave(most_rcnt_creat,wl_id_code, FUN=~na.locf(x, na.rm=FALSE)))
thoracic2 <- transform(thoracic2, func_stat_trr = fn$ave(func_stat_trr, wl_id_code, FUN = ~ na.locf(x, na.rm=FALSE)))
# Create the "complete creatinine" variable, which equals "serum" in 2007-2014 and "creatinine" in 2015
LAS_hist2$compcreat <- rep(NA, nrow(LAS_hist2))
for (i in 1:nrow(LAS_hist2)){
if (LAS_hist2$chg_date[i]<"2015-01-01"){
LAS_hist2$compcreat[i] <- LAS_hist2$serum[i]
} else {
LAS_hist2$compcreat[i] <- LAS_hist2$creatinine[i]
}
}
# Lag the "compcreat", "serum" and "chg_date" values by 1 within individuals
# See: https://stackoverflow.com/questions/26291988/how-to-create-a-lag-variable-within-each-group/26292059
LAS_hist3 <-
LAS_hist2 %>%
group_by(wl_id_code) %>%
mutate(lag.compcreat = dplyr::lag(compcreat, n = 1, default = NA),
lag.serum = dplyr::lag(serum, n = 1, default = NA),
lag.chg_date = dplyr::lag(chg_date, n = 1, default = NA))
# Compute creatinine increase (compare values between visits within 6 months of each other), and flag if >= 150%
LAS_hist3$creatincp <- rep(NA, nrow(LAS_hist3)) # initialize creatinine increase indicator
for (i in 1:nrow(LAS_hist3)){
if (!is.na(LAS_hist3$chg_date[i]) && !is.na(LAS_hist3$lag.chg_date[i])){
if(as.numeric(abs(LAS_hist3$chg_date[i]-LAS_hist3$lag.chg_date[i]))< 182.5){ # Ensure visits are within 6 months
LAS_hist3$creatincp[i]<-(abs(LAS_hist3$compcreat[i]-LAS_hist3$lag.compcreat[i])/LAS_hist3$lag.compcreat[i])*100
} else {
LAS_hist3$creatincp[i] <- 0
}
} else {
LAS_hist3$creatincp[i] <- NA
}
}
# Initialize creatinine increase flag
LAS_hist3$creatincp_flag <- rep(NA, nrow(LAS_hist3))
for (i in 1:nrow(LAS_hist3)){ # Loop over each row
if (!is.na(LAS_hist3$creatincp[i]) && LAS_hist3$creatincp[i] >= 150 &&
max(LAS_hist3$compcreat[i], LAS_hist3$lag.compcreat[i]) >= 1){
LAS_hist3$creatincp_flag[i] <- 1
} else if (!is.na(LAS_hist3$creatincp[i]) &&
((LAS_hist3$creatincp[i] < 150) | (max(LAS_hist3$compcreat[i], LAS_hist3$lag.compcreat[i]) < 1))){
LAS_hist3$creatincp_flag[i] <- 0
}
}
# Convert creatinine increase flag to a factor variable
LAS_hist3$creatincp_flag <- as.factor(LAS_hist3$creatincp_flag)
# Merge the filled-in data sets together, and then merge these with the post-transplant data set provided by Michael Harhay
raw_locf <- merge(x=thoracic2, y=LAS_hist3, by=c("wl_id_code"))
temp_locf <- merge(x=UNOSclean2, y=LAS_hist3, by=c("wl_id_code", "chg_date"))
temp_locf2 <- merge(x=UNOSclean2, y=raw_locf, by=c("wl_id_code", "chg_date"))
UNOSclean3 <- temp_locf2
UNOSclean3$calc_vent_use.y <- as.factor(UNOSclean3$calc_vent_use.y)
# Rename filled-in variables back to their original names
UNOSclean4 <- dplyr::rename(UNOSclean3,
age = age.y,
ci = ci.y,
calc_o2 = calc_o2.y,
calc_vent_use = calc_vent_use.y,
creatinine = creatinine.y,
init_creat = init_creat.y,
end_creat = end_creat.y,
func_stat_trr = func_stat_trr.y,
calc_six_min_walk = calc_six_min_walk.y,
tx_date = tx_date.y)
# Create an "ever increase" indicator which equals 1 if a patient ever experienced an increase in creatinine >= 150%,
# and 0 if a patient never experienced an increase in creatinine >= 150%
UNOSclean4$ever_creatinc <- rep(NA, nrow(UNOSclean4)) # initialize "ever increase" indicator
creatincids <- LAS_hist3$wl_id_code[which(LAS_hist3$creatincp_flag==1)] # Get ID # of people with flag==1
nocreatincids <- LAS_hist3$wl_id_code[which(LAS_hist3$creatincp_flag==0)] # Get ID # of people with flag==0
for (i in 1:nrow(UNOSclean4)){ # Loop over each row in the clean data set
if (UNOSclean4$wl_id_code[i] %in% creatincids){ # If ID # matches and flag==1
UNOSclean4$ever_creatinc[i] <- 1
} else if (UNOSclean4$wl_id_code[i] %in% nocreatincids){ # If ID # matches and flag==0
UNOSclean4$ever_creatinc[i] <- 0
} else { # Otherwise, creatinine increase is missing
UNOSclean4$ever_creatinc[i] <- NA
}
}
# Convert to a factor variable
UNOSclean4$ever_creatinc <- as.factor(UNOSclean4$ever_creatinc)
# ALTERNATIVE APPROACH: Compute change in creatinine as difference between END_CREAT and LAG.COMPCREAT
UNOSclean4$creatinc2 <- rep(NA, nrow(UNOSclean4)) # initialize creatinine increase indicator
for (i in 1:nrow(UNOSclean4)){
if (!is.na(UNOSclean4$tx_date[i]) && !is.na(UNOSclean4$lag.chg_date[i])){
if(as.numeric(abs(UNOSclean4$tx_date[i]-UNOSclean4$lag.chg_date[i]))<182.5){ #Ensure visits are within 6 months
UNOSclean4$creatinc2[i]<-(abs(UNOSclean4$end_creat[i]-UNOSclean4$lag.compcreat[i])/UNOSclean4$lag.compcreat[i])*100
} else {
UNOSclean4$creatinc2[i] <- 0
}
} else {
UNOSclean4$creatinc2[i] <- NA
}
}
# Initialize creatinine increase flag
UNOSclean4$creatinc2_flag <- rep(NA, nrow(UNOSclean4))
for (i in 1:nrow(UNOSclean4)){ # Loop over each row
if (!is.na(UNOSclean4$creatinc2[i]) && (UNOSclean4$creatinc2[i] >= 150) &&
(max(UNOSclean4$end_creat[i],UNOSclean4$lag.compcreat[i]) >= 1)){
UNOSclean4$creatinc2_flag[i] <- 1
} else if (!is.na(UNOSclean4$creatinc2[i]) &&
((UNOSclean4$creatinc2[i] < 150) | (max(UNOSclean4$end_creat[i],UNOSclean4$lag.compcreat[i]) < 1))){
UNOSclean4$creatinc2_flag[i] <- 0
}
}
# Convert creatinine increase flag to a factor variable
UNOSclean4$creatinc2_flag <- as.factor(UNOSclean4$creatinc2_flag)
# ALTERNATIVE APPROACH 2: Compute change in creatinine as difference between END_CREAT and LAG.SERUM
UNOSclean4$creatinc3 <- rep(NA, nrow(UNOSclean4)) # initialize creatinine increase indicator
for (i in 1:nrow(UNOSclean4)){
if (!is.na(UNOSclean4$tx_date[i]) && !is.na(UNOSclean4$lag.chg_date[i])){
if(as.numeric(abs(UNOSclean4$tx_date[i]-UNOSclean4$lag.chg_date[i]))<182.5){ #Ensure visits are within 6 months
UNOSclean4$creatinc3[i]<-(abs(UNOSclean4$end_creat[i]-UNOSclean4$lag.serum[i])/UNOSclean4$lag.serum[i])*100
} else {
UNOSclean4$creatinc3[i] <- 0
}
} else {
UNOSclean4$creatinc3[i] <- NA
}
}
# Initialize creatinine increase flag
UNOSclean4$creatinc3_flag <- rep(NA, nrow(UNOSclean4))
for (i in 1:nrow(UNOSclean4)){ # Loop over each row
if (!is.na(UNOSclean4$creatinc3[i]) && (UNOSclean4$creatinc3[i] >= 150) &&
(max(UNOSclean4$end_creat[i],UNOSclean4$lag.serum[i]) >= 1)){
UNOSclean4$creatinc3_flag[i] <- 1
} else if (!is.na(UNOSclean4$creatinc3[i]) &&
((UNOSclean4$creatinc3[i] < 150) | (max(UNOSclean4$end_creat[i],UNOSclean4$lag.serum[i]) < 1))){
UNOSclean4$creatinc3_flag[i] <- 0
}
}
# Convert creatinine increase flag to a factor variable
UNOSclean4$creatinc3_flag <- as.factor(UNOSclean4$creatinc3_flag)
# Compute percentage of missingness for serum in LAS history data (both raw and LOCF) pre-2015
# Compute percentage of missingness for creatinine in LAS history data (both raw and LOCF) post-2015
# LAS History File - RAW
misspct_LAShistraw_pre2015 <- (as.numeric(summary(LAS_hist$serum[which(LAS_hist$chg_date<"2015-01-01")])[7])/
length(LAS_hist$serum[which(LAS_hist$chg_date<"2015-01-01")]))*100
misspct_LAShistraw_post2015 <- (as.numeric(summary(LAS_hist$creatinine[which(LAS_hist$chg_date>="2015-01-01")])[7])/
length(LAS_hist$creatinine[which(LAS_hist$chg_date>="2015-01-01")]))*100
# LAS History File - LOCF
misspct_LAShistLOCF_pre2015 <-(as.numeric(summary(LAS_hist3$serum[which(LAS_hist3$chg_date<"2015-01-01")])[7])/
length(LAS_hist3$serum[which(LAS_hist3$chg_date<"2015-01-01")]))*100
misspct_LAShistLOCF_post2015<-(as.numeric(summary(LAS_hist3$creatinine[which(LAS_hist3$chg_date>="2015-01-01")])[7])/
length(LAS_hist3$creatinine[which(LAS_hist3$chg_date>="2015-01-01")]))*100
# Create a table of the missingness percentages for creatinine
miss_creat_tab <- as.data.frame(rbind(c(misspct_LAShistraw_pre2015, misspct_LAShistraw_post2015),
c(misspct_LAShistLOCF_pre2015, misspct_LAShistLOCF_post2015)))
colnames(miss_creat_tab) <- c("% Missing in 2007-2014", "% Missing in 2015")
rownames(miss_creat_tab) <- c("LAS History: Serum Creatinine (Raw)", "LAS History: Serum Creatinine (LOCF)")
# Recode missing creatinine increase indicator values with 0 for modeling purposes
for (i in 1:nrow(UNOSclean4)){
if (is.na(UNOSclean4$creatinc2_flag[i])){
UNOSclean4$creatinc2_flag[i] <- 0 # If creatinine increase flag is NA, replace with 0
}
}
# Trim creatinine, cardiac index, and six-minute walk values above the 99th percentile
UNOSclean4 <- UNOSclean4[UNOSclean4$end_creat < quantile(UNOSclean2$end_creat, 0.99), ]
UNOSclean4 <- subset(UNOSclean4, (is.na(UNOSclean4$ci) | (UNOSclean4$ci < quantile(UNOSclean4$ci, 0.99, na.rm=TRUE))))
UNOSclean4 <- subset(UNOSclean4, (is.na(UNOSclean4$calc_six_min_walk) | (UNOSclean4$calc_six_min_walk < quantile(UNOSclean4$calc_six_min_walk, 0.99, na.rm=TRUE))))
# Center the age variable and six minute walk variable
UNOSclean4$age_cent <- UNOSclean4$age - 45.9972602
UNOSclean4$walk_cent <- 1200-UNOSclean4$calc_six_min_walk
# Recode functional status
UNOSclean4$func_stat_trr <- as.numeric(UNOSclean4$func_stat_trr)
# Initialize binary indicator (1 if >=2050, 0 otherwise)
UNOSclean4$func_stat_trr_bin <- rep(NA, nrow(UNOSclean4))
for (i in 1:nrow(UNOSclean4)){ # Loop over each patient
if (!(is.na(UNOSclean4$func_stat_trr[i])) && UNOSclean4$func_stat_trr[i] == 998){
UNOSclean4$func_stat_trr[i] <- NA # Replace "998" with missing
}
if(!(is.na(UNOSclean4$func_stat_trr[i])) && UNOSclean4$func_stat_trr[i] >= 2050){
UNOSclean4$func_stat_trr_bin[i] <- 1
} else if (!(is.na(UNOSclean4$func_stat_trr[i])) && UNOSclean4$func_stat_trr[i] < 2050){
UNOSclean4$func_stat_trr_bin[i] <- 0
}
}
# Convert functional status into a factor variable
UNOSclean4$func_stat_trr <- as.factor(UNOSclean4$func_stat_trr)
UNOSclean4$func_stat_trr_bin <- as.factor(UNOSclean4$func_stat_trr_bin)
# Recode Mechanical Ventilation
# (indicator: 1=continuous mechanical ventilation, 0=everything else)
UNOSclean4$calc_vent_use_bin <- rep(NA, nrow(UNOSclean4)) # Initialize indicator
for (i in 1:nrow(UNOSclean4)){ # Loop over each patient
if (!is.na(UNOSclean4$calc_vent_use[i]) && UNOSclean4$calc_vent_use[i] == 2){
UNOSclean4$calc_vent_use_bin[i] <- 1
} else if (!is.na(UNOSclean4$calc_vent_use[i]) && !(UNOSclean4$calc_vent_use[i] == 2)){
UNOSclean4$calc_vent_use_bin[i] <- 0
}
}
# Convert indicator to a factor variable
UNOSclean4$calc_vent_use_bin <- as.factor(UNOSclean4$calc_vent_use_bin)
# Recode Cardiac Index
# (indicator: 1 if cardiac index < 2 L/min/m^2, 0 otherwise)
UNOSclean4$ci_bin <- rep(NA, nrow(UNOSclean4)) # Initialize indicator
for (i in 1:nrow(UNOSclean4)){ # Loop over each patient
if (!is.na(UNOSclean4$ci[i]) && UNOSclean4$ci[i] == 2){
UNOSclean4$ci_bin[i] <- 1
} else if (!is.na(UNOSclean4$ci[i]) && !(UNOSclean4$ci[i] == 2)){
UNOSclean4$ci_bin[i] <- 0
}
}
# Convert indicator to a factor variable
UNOSclean4$ci_bin <- as.factor(UNOSclean4$ci_bin)
# Save the cleaned UNOS data
save(UNOSclean4, file="K:/Data/UNOSclean4.RData")Run the appropriate code chunk below to load either the simulated data or the cleaned UNOS data:
# Load the simulated data from GitHub
load(url("https://raw.githubusercontent.com/eschnell/BMIN503_Final_Project/master/simdat.RData"))
# Rename the variables in the simulated data set to match the UNOS data
names(simdat) <- c("ytx", "diag_group", "age", "ci_bin", "calc_vent_use_bin", "end_creat",
"creatinc2_flag", "func_stat_trr_bin", "calc_o2", "calc_six_min_walk",
"ptime", "dx", "dxObstructive", "dxPulmHtx", "dxCF", "dxPulmFib",
"true_prob", "pstatus"
)
# Center the age, six-minute walk, and year variables
simdat$age_cent <- simdat$age - 45.9972602
simdat$walk_cent <- 1200-simdat$calc_six_min_walk
simdat$ytx_cent <- simdat$ytx-2007
UNOSclean4 <- simdat# Load the cleaned UNOS data
load("K:/Data/UNOSclean4.RData")
UNOSclean4$pstatus <- as.numeric(UNOSclean4$pstatus.y)
UNOSclean4$ptime <- UNOSclean4$ptime.yNow subset the data by year:
# Subset the data by year
UNOSclean4 <- as.data.frame(UNOSclean4[order(UNOSclean4$ytx),])
UNOS2007 <- subset(UNOSclean4, ytx==2007)
UNOS2008 <- subset(UNOSclean4, ytx==2008)
UNOS2009 <- subset(UNOSclean4, ytx==2009)
UNOS2010 <- subset(UNOSclean4, ytx==2010)
UNOS2011 <- subset(UNOSclean4, ytx==2011)
UNOS2012 <- subset(UNOSclean4, ytx==2012)
UNOS2013 <- subset(UNOSclean4, ytx==2013)
UNOS2014 <- subset(UNOSclean4, ytx==2014)
UNOS2015 <- subset(UNOSclean4, ytx==2015)
# Create a vector of years for results table
years <- as.data.frame(c(2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015))
names(years) <- "Year"Next, we conduct exploratory analyses to address the following questions:
Mortality risk was estimated by fitting a null (intercept-only) logistic regression model with binary mortality as the outcome (1=dead within one year post-transplant, 0=alive). For comparison, we also display the Kaplan-Meier estimated survival for each year under study.
# Fit null logistic regression model for mortality by year
# pstatus: 1=dead, 0=alive
logitmod_null_2007 <- glm(pstatus ~ 1, data=UNOS2007, family="binomial")
logitmod_null_2008 <- glm(pstatus ~ 1, data=UNOS2008, family="binomial")
logitmod_null_2009 <- glm(pstatus ~ 1, data=UNOS2009, family="binomial")
logitmod_null_2010 <- glm(pstatus ~ 1, data=UNOS2010, family="binomial")
logitmod_null_2011 <- glm(pstatus ~ 1, data=UNOS2011, family="binomial")
logitmod_null_2012 <- glm(pstatus ~ 1, data=UNOS2012, family="binomial")
logitmod_null_2013 <- glm(pstatus ~ 1, data=UNOS2013, family="binomial")
logitmod_null_2014 <- glm(pstatus ~ 1, data=UNOS2014, family="binomial")
logitmod_null_2015 <- glm(pstatus ~ 1, data=UNOS2015, family="binomial")
# Create table of INTERCEPT coefficients
int_only_tab <- rbind(coef(summary(logitmod_null_2007))[1,1:2],
coef(summary(logitmod_null_2008))[1,1:2],
coef(summary(logitmod_null_2009))[1,1:2],
coef(summary(logitmod_null_2010))[1,1:2],
coef(summary(logitmod_null_2011))[1,1:2],
coef(summary(logitmod_null_2012))[1,1:2],
coef(summary(logitmod_null_2013))[1,1:2],
coef(summary(logitmod_null_2014))[1,1:2],
coef(summary(logitmod_null_2015))[1,1:2])
int_only_tab <- cbind(years, int_only_tab)
# Plot the INTERCEPT coefficient over time
p1 <- ggplot(data=int_only_tab, aes(x=Year, y=Estimate)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`,
ymax=Estimate+`Std. Error`),
width=.2) +
geom_point()
# Format the axes
p1 + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015), labels=c("2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")) +
ggtitle("Null Logistic Model: Intercept by Year")# Exponentiate estimates to get intercept on Odds Ratio scale
int_only_tab$expEst <- exp(int_only_tab$Estimate)
int_only_tab$expUpB <- exp(int_only_tab$Estimate + int_only_tab$`Std. Error`)
int_only_tab$expLowB <- exp(int_only_tab$Estimate - int_only_tab$`Std. Error`)
# Plot the INTERCEPT coefficient over time (Odds Ratio Scale)
p1b <- ggplot(data=int_only_tab, aes(x=Year, y=expEst)) +
geom_errorbar(aes(ymin=expLowB, ymax=expUpB), width=.2) +
geom_point()
# Format the axes
p1b + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015), labels=c("2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")) + ylab("Odds Ratio") +
ggtitle("Null Logistic Model: Intercept by Year, Odds Ratio Scale")# Calculate the observed 1-year survival probability at each year
# (Kaplan-Meier Estimate)
obsprob <- survfit(Surv(UNOSclean4$ptime/365.25, UNOSclean4$pstatus) ~ UNOSclean4$ytx, type="kaplan-meier", conf.type="log")
# Construct table of the observed 1-year survival probabilities
nrisk <- summary(obsprob, times=1.0)[3] # Number at risk
nevent <- summary(obsprob, times=1.0)[4] # Number of deaths
ncensor <- summary(obsprob, times=1.0)[5] # Number censored
survprob <- summary(obsprob, times=1.0)[6] # 1-year survival probability
lowCI <- summary(obsprob, times=1.0)[10] # Lower 95% CI, log type
upCI <- summary(obsprob, times=1.0)[11] # Upper 95% CI, log type
survtab <- cbind(years, nrisk, nevent, ncensor, survprob, lowCI, upCI)
names(survtab) <- c("Year", "# Risk", "# Deaths", "# Censored", "S(t)", "Lower 95% CI", "Upper 95% CI")
print(survtab)## Year # Risk # Deaths # Censored S(t) Lower 95% CI Upper 95% CI NA NA NA NA NA
## 1 2007 0 1130 140 0 83 0.10708684 0.8929132 0.008558530 0.008558530 0 0
## 2 2008 0 1147 149 0 83 0.11081499 0.8891850 0.008568116 0.008568116 0 0
## 3 2009 0 1231 165 0 144 0.11304233 0.8869577 0.008299659 0.008299659 0 0
## 4 2010 0 1333 196 0 144 0.12277753 0.8772225 0.008225611 0.008225611 0 0
## 5 2011 0 1351 175 0 212 0.10790259 0.8920974 0.007719990 0.007719990 0 0
## 6 2012 0 1277 159 0 242 0.10255613 0.8974439 0.007724119 0.007724119 0 0
## 7 2013 0 1300 182 0 323 0.11238293 0.8876171 0.007880461 0.007880461 0 0
## 8 2014 0 1248 160 0 430 0.09920483 0.9007952 0.007486749 0.007486749 0 0
## 9 2015 0 1155 199 0 595 0.12478436 0.8752156 0.008366087 0.008366087 0 0
# Plot S(t) estimate for each year
St <- ggplot(data=survtab, aes(x=Year, y=`S(t)`)) +
geom_errorbar(aes(ymin=`Lower 95% CI`,
ymax=`Upper 95% CI`),
width=.2) +
geom_point()
# Format the axes
St + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015), labels=c("2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015")) +
ggtitle("Observed 1-Year Post-Transplant Survival Probability")Prevalence of risk factors was visualized via stacked bar charts or violin plots with boxplots overlaid, as appropriate.
# Sample size, by year
sampsizevec <- c(nrow(UNOS2007), nrow(UNOS2008), nrow(UNOS2009),
nrow(UNOS2010), nrow(UNOS2011), nrow(UNOS2012),
nrow(UNOS2013), nrow(UNOS2014), nrow(UNOS2015))
sampsizetab <- as.data.frame(cbind(years, sampsizevec))
names(sampsizetab) <- c("Year", "SampleSize")
ssbar <- ggplot(data=sampsizetab, aes(x=factor(Year), y=SampleSize)) +
geom_bar(stat="identity")
ssbar + labs(title="Sample Size, by Year",
x = "Year", y = "Sample Size")# Sample size by disease category, by year
numdx2007 <- c(summary(UNOS2007$dx)[1], summary(UNOS2007$dx)[2], summary(UNOS2007$dx)[3], summary(UNOS2007$dx)[4])
numdx2008 <- c(summary(UNOS2008$dx)[1], summary(UNOS2008$dx)[2], summary(UNOS2008$dx)[3], summary(UNOS2008$dx)[4])
numdx2009 <- c(summary(UNOS2009$dx)[1], summary(UNOS2009$dx)[2], summary(UNOS2009$dx)[3], summary(UNOS2009$dx)[4])
numdx2010 <- c(summary(UNOS2010$dx)[1], summary(UNOS2010$dx)[2], summary(UNOS2010$dx)[3], summary(UNOS2010$dx)[4])
numdx2011 <- c(summary(UNOS2011$dx)[1], summary(UNOS2011$dx)[2], summary(UNOS2011$dx)[3], summary(UNOS2011$dx)[4])
numdx2012 <- c(summary(UNOS2012$dx)[1], summary(UNOS2012$dx)[2], summary(UNOS2012$dx)[3], summary(UNOS2012$dx)[4])
numdx2013 <- c(summary(UNOS2013$dx)[1], summary(UNOS2013$dx)[2], summary(UNOS2013$dx)[3], summary(UNOS2013$dx)[4])
numdx2014 <- c(summary(UNOS2014$dx)[1], summary(UNOS2014$dx)[2], summary(UNOS2014$dx)[3], summary(UNOS2014$dx)[4])
numdx2015 <- c(summary(UNOS2015$dx)[1], summary(UNOS2015$dx)[2], summary(UNOS2015$dx)[3], summary(UNOS2015$dx)[4])
numdxtab <- as.data.frame(rbind(numdx2007, numdx2008, numdx2009,
numdx2010, numdx2011, numdx2012,
numdx2013, numdx2014, numdx2015))
numdxtab <- as.data.frame(cbind(years, numdxtab))
numdxtab$Year <- factor(numdxtab$Year)
# Convert into long format
library(tidyr)
dxtablong <- gather(numdxtab, dx, count,
`Obstructive dx`:`Pulm fibrosis`,
factor_key=TRUE)
# Sort the data by year and disease category
dxtablong2 <- plyr::arrange(dxtablong, Year, dx)
# Compute the cumulative sum
dxtablong2 <- plyr::ddply(dxtablong2, "Year", transform, label_y = cumsum(count))
# Stacked bar chart (frequency)
numdxbar <- ggplot(data=dxtablong2, aes(x=Year, y=count, fill=dx)) +
geom_bar(stat="identity")
numdxbar + labs(title="Sample size by primary disease category (frequency)", x = "Year", y = "Frequency")# Stacked bar chart (percents)
# Do a group-wise transform by year and compute percents
pctdxtab <- plyr::ddply(dxtablong, "Year", transform,
pctdx = (count/sum(count))*100)
pctdxbar <- ggplot(data=pctdxtab, aes(x=Year, y=pctdx, fill=dx)) +
geom_bar(stat="identity")
pctdxbar + labs(title="Sample size by primary disease category (percent)", x = "Year", y = "Percent")As one can see, the number of transplants increase as year increases. Additionally, the proportion of pulmonary fibrosis patients increases over time.
Next, we present violin plots of the key risk factors identified above, separately by year. Boxplots are overlaid on top of the violin plots to facilitate comparisons across years.
# Violin plot with boxplot overlay for key LAS variables, separately by year
# Age (age)
ggplot(data=UNOSclean4, aes(x=factor(ytx), y=age)) +
geom_violin(fill="lightblue") +
geom_boxplot(width=0.1, outlier.color=NA) +
xlab("Year") +
ylab("Age") +
ggtitle("Age over Time")# Sample size by cardiac index < 2 L/min/m^2 (dichotomous), by year
numci2007 <- c(summary(UNOS2007$ci_bin)[1],
summary(UNOS2007$ci_bin)[2])
numci2008 <- c(summary(UNOS2008$ci_bin)[1],
summary(UNOS2008$ci_bin)[2])
numci2009 <- c(summary(UNOS2009$ci_bin)[1],
summary(UNOS2009$ci_bin)[2])
numci2010 <- c(summary(UNOS2010$ci_bin)[1],
summary(UNOS2010$ci_bin)[2])
numci2011 <- c(summary(UNOS2011$ci_bin)[1],
summary(UNOS2011$ci_bin)[2])
numci2012 <- c(summary(UNOS2012$ci_bin)[1],
summary(UNOS2012$ci_bin)[2])
numci2013 <- c(summary(UNOS2013$ci_bin)[1],
summary(UNOS2013$ci_bin)[2])
numci2014 <- c(summary(UNOS2014$ci_bin)[1],
summary(UNOS2014$ci_bin)[2])
numci2015 <- c(summary(UNOS2015$ci_bin)[1],
summary(UNOS2015$ci_bin)[2])
numcitab <- as.data.frame(rbind(numci2007, numci2008, numci2009,
numci2010, numci2011, numci2012,
numci2013, numci2014, numci2015))
numcitab <- as.data.frame(cbind(years, numcitab))
numcitab$Year <- factor(numcitab$Year)
# Convert into long format
citablong <- gather(numcitab, ci_cat, count,
`0`:`1`,
factor_key=TRUE)
# Sort the data by year and disease category
citablong2 <- plyr::arrange(citablong, Year, ci_cat)
# Compute the cumulative sum
citablong2 <- plyr::ddply(citablong2, "Year", transform, label_y = cumsum(count))
# Stacked bar chart (frequency)
numcibar <- ggplot(data=citablong2, aes(x=Year, y=count,
fill=ci_cat)) +
geom_bar(stat="identity")
numcibar + labs(title="Sample size by cardiac index indicator (frequency)", x = "Year", y = "Frequency") +
scale_fill_discrete(labels=c(expression("Cardiac Index >= 2 L/min/" ~ m^{2}), expression("Cardiac Index < 2 L/min/" ~ m^{2})))# Stacked bar chart (percents)
# Do a group-wise transform by year and compute percents
pctcitab <- plyr::ddply(citablong, "Year", transform,
pctci = (count/sum(count))*100)
pctcibar <- ggplot(data=pctcitab, aes(x=Year, y=pctci,
fill=ci_cat)) +
geom_bar(stat="identity")
pctcibar + labs(title="Sample size by cardiac index indicator (percent)", x = "Year", y = "Percent") +
scale_fill_discrete(labels=c(expression("Cardiac Index >= 2 L/min/" ~ m^{2}), expression("Cardiac Index < 2 L/min/" ~ m^{2})))# Sample size by mechanical ventilation (dichotomous), by year
numvent2007 <- c(summary(UNOS2007$calc_vent_use_bin)[1],
summary(UNOS2007$calc_vent_use_bin)[2])
numvent2008 <- c(summary(UNOS2008$calc_vent_use_bin)[1],
summary(UNOS2008$calc_vent_use_bin)[2])
numvent2009 <- c(summary(UNOS2009$calc_vent_use_bin)[1],
summary(UNOS2009$calc_vent_use_bin)[2])
numvent2010 <- c(summary(UNOS2010$calc_vent_use_bin)[1],
summary(UNOS2010$calc_vent_use_bin)[2])
numvent2011 <- c(summary(UNOS2011$calc_vent_use_bin)[1],
summary(UNOS2011$calc_vent_use_bin)[2])
numvent2012 <- c(summary(UNOS2012$calc_vent_use_bin)[1],
summary(UNOS2012$calc_vent_use_bin)[2])
numvent2013 <- c(summary(UNOS2013$calc_vent_use_bin)[1],
summary(UNOS2013$calc_vent_use_bin)[2])
numvent2014 <- c(summary(UNOS2014$calc_vent_use_bin)[1],
summary(UNOS2014$calc_vent_use_bin)[2])
numvent2015 <- c(summary(UNOS2015$calc_vent_use_bin)[1],
summary(UNOS2015$calc_vent_use_bin)[2])
numventtab <- as.data.frame(rbind(numvent2007, numvent2008, numvent2009,
numvent2010, numvent2011, numvent2012,
numvent2013, numvent2014, numvent2015))
numventtab <- as.data.frame(cbind(years, numventtab))
numventtab$Year <- factor(numventtab$Year)
# Convert into long format
venttablong <- gather(numventtab, vent_cat, count,
`0`:`1`,
factor_key=TRUE)
# Sort the data by year and disease category
venttablong2 <- plyr::arrange(venttablong, Year, vent_cat)
# Compute the cumulative sum
venttablong2 <- plyr::ddply(venttablong2, "Year", transform, label_y = cumsum(count))
# Stacked bar chart (frequency)
numventbar <- ggplot(data=venttablong2, aes(x=Year, y=count,
fill=vent_cat)) +
geom_bar(stat="identity")
numventbar + labs(title="Sample size by mechanical ventilation indicator (frequency)", x = "Year", y = "Frequency") +
scale_fill_discrete(labels=c("Not on continuous mechanical ventilation", "On continuous mechanical ventilation"))# Stacked bar chart (percents)
# Do a group-wise transform by year and compute percents
pctventtab <- plyr::ddply(venttablong, "Year", transform,
pctvent = (count/sum(count))*100)
pctventbar <- ggplot(data=pctventtab, aes(x=Year, y=pctvent,
fill=vent_cat)) +
geom_bar(stat="identity")
pctventbar + labs(title="Sample size by mechanical ventilation indicator (percent)", x = "Year", y = "Percent") +
scale_fill_discrete(labels=c("Not on continuous mechanical ventilation", "On continuous mechanical ventilation"))# Creatinine (END_CREAT)
ggplot(data=UNOSclean4, aes(x=factor(ytx), y=end_creat)) +
geom_violin(fill="lightblue") +
geom_boxplot(width=0.1, outlier.color=NA) +
xlab("Year") +
ylab("Creatinine (mg/dL)") +
ggtitle("Creatinine at transplant (END_CREAT) over Time")# Creatinine increase >= 150%
# Sample size by creatinine increase indicator, by year
numcii2007 <- c(summary(UNOS2007$creatinc2_flag)[1],
summary(UNOS2007$creatinc2_flag)[2])
numcii2008 <- c(summary(UNOS2008$creatinc2_flag)[1],
summary(UNOS2008$creatinc2_flag)[2])
numcii2009 <- c(summary(UNOS2009$creatinc2_flag)[1],
summary(UNOS2009$creatinc2_flag)[2])
numcii2010 <- c(summary(UNOS2010$creatinc2_flag)[1],
summary(UNOS2010$creatinc2_flag)[2])
numcii2011 <- c(summary(UNOS2011$creatinc2_flag)[1],
summary(UNOS2011$creatinc2_flag)[2])
numcii2012 <- c(summary(UNOS2012$creatinc2_flag)[1],
summary(UNOS2012$creatinc2_flag)[2])
numcii2013 <- c(summary(UNOS2013$creatinc2_flag)[1],
summary(UNOS2013$creatinc2_flag)[2])
numcii2014 <- c(summary(UNOS2014$creatinc2_flag)[1],
summary(UNOS2014$creatinc2_flag)[2])
numcii2015 <- c(summary(UNOS2015$creatinc2_flag)[1],
summary(UNOS2015$creatinc2_flag)[2])
numciitab <- as.data.frame(rbind(numcii2007, numcii2008, numcii2009,
numcii2010, numcii2011, numcii2012,
numcii2013, numcii2014, numcii2015))
numciitab <- as.data.frame(cbind(years, numciitab))
numciitab$Year <- factor(numciitab$Year)
# Convert into long format
ciitablong <- gather(numciitab, cii_cat, count,
`0`:`1`,
factor_key=TRUE)
# Sort the data by year and creatinine increase indicator
ciitablong2 <- plyr::arrange(ciitablong, Year, cii_cat)
# Compute the cumulative sum
ciitablong2 <- plyr::ddply(ciitablong2, "Year", transform, label_y = cumsum(count))
# Stacked bar chart (frequency)
numciibar <- ggplot(data=ciitablong2, aes(x=Year, y=count,
fill=cii_cat)) +
geom_bar(stat="identity")
numciibar + labs(title="Sample size by creatinine increase indicator (frequency)", x = "Year", y = "Frequency") +
scale_fill_discrete(labels=c("< 150%", ">= 150%"))# Stacked bar chart (percents)
# Do a group-wise transform by year and compute percents
pctciitab <- plyr::ddply(ciitablong, "Year", transform,
pctcii = (count/sum(count))*100)
pctciibar <- ggplot(data=pctciitab, aes(x=Year, y=pctcii,
fill=cii_cat)) +
geom_bar(stat="identity")
pctciibar + labs(title="Sample size by creatinine increase indicator (percent)", x = "Year", y = "Percent") +
scale_fill_discrete(labels=c("< 150%", ">= 150%"))# Sample size by functional status (dichotomous), by year
numfst2007 <- c(summary(UNOS2007$func_stat_trr_bin)[1],
summary(UNOS2007$func_stat_trr_bin)[2])
numfst2008 <- c(summary(UNOS2008$func_stat_trr_bin)[1],
summary(UNOS2008$func_stat_trr_bin)[2])
numfst2009 <- c(summary(UNOS2009$func_stat_trr_bin)[1],
summary(UNOS2009$func_stat_trr_bin)[2])
numfst2010 <- c(summary(UNOS2010$func_stat_trr_bin)[1],
summary(UNOS2010$func_stat_trr_bin)[2])
numfst2011 <- c(summary(UNOS2011$func_stat_trr_bin)[1],
summary(UNOS2011$func_stat_trr_bin)[2])
numfst2012 <- c(summary(UNOS2012$func_stat_trr_bin)[1],
summary(UNOS2012$func_stat_trr_bin)[2])
numfst2013 <- c(summary(UNOS2013$func_stat_trr_bin)[1],
summary(UNOS2013$func_stat_trr_bin)[2])
numfst2014 <- c(summary(UNOS2014$func_stat_trr_bin)[1],
summary(UNOS2014$func_stat_trr_bin)[2])
numfst2015 <- c(summary(UNOS2015$func_stat_trr_bin)[1],
summary(UNOS2015$func_stat_trr_bin)[2])
numfsttab <- as.data.frame(rbind(numfst2007, numfst2008, numfst2009,
numfst2010, numfst2011, numfst2012,
numfst2013, numfst2014, numfst2015))
numfsttab <- as.data.frame(cbind(years, numfsttab))
numfsttab$Year <- factor(numfsttab$Year)
# Convert into long format
fsttablong <- gather(numfsttab, fst_cat, count,
`0`:`1`,
factor_key=TRUE)
# Sort the data by year and disease category
fsttablong2 <- plyr::arrange(fsttablong, Year, fst_cat)
# Compute the cumulative sum
fsttablong2 <- plyr::ddply(fsttablong2, "Year", transform, label_y = cumsum(count))
# Stacked bar chart (frequency)
numfstbar <- ggplot(data=fsttablong2, aes(x=Year, y=count,
fill=fst_cat)) +
geom_bar(stat="identity")
numfstbar + labs(title="Sample size by functional status indicator (frequency)", x = "Year", y = "Frequency") +
scale_fill_discrete(labels=c("< 2050", ">= 2050"))# Stacked bar chart (percents)
# Do a group-wise transform by year and compute percents
pctfsttab <- plyr::ddply(fsttablong, "Year", transform,
pctfst = (count/sum(count))*100)
pctfstbar <- ggplot(data=pctfsttab, aes(x=Year, y=pctfst,
fill=fst_cat)) +
geom_bar(stat="identity")
pctfstbar + labs(title="Sample size by functional status indicator (percent)", x = "Year", y = "Percent") +
scale_fill_discrete(labels=c("< 2050: Disabled", ">= 2050: Normal Function"))# Oxygen Requirement (calc_o2)
ggplot(data=UNOSclean4, aes(x=factor(ytx), y=calc_o2)) +
geom_violin(fill="lightblue") +
geom_boxplot(width=0.1, outlier.color=NA) +
xlab("Year") +
ylab("Oxygen Requirement (L/min)") +
ggtitle("Oxygen Requirement over Time")# Six Minute Walk Distance (calc_six_min_walk)
ggplot(data=UNOSclean4, aes(x=factor(ytx), y=calc_six_min_walk)) +
geom_violin(fill="lightblue") +
geom_boxplot(width=0.1, outlier.color=NA) +
xlab("Year") +
ylab("Six Minute Walk Distance (feet)") +
ggtitle("Six Minute Walk Distance over Time")Changes in LAS coefficient estimates were examined by fitting adjusted logistic regression models with binary mortality as the outcome and the covariates included in the current LAS model as predictors. All models also include variable-by-year interaction terms, with 2007 being treated as the reference year. In each case, we first identify which interaction terms are significant, and then plot the actual coefficient estimates for these significant variables over time. Coefficient estimates for variables with non-significant interactions with time are not plotted.
# Test for variable-by-time interactions using the multivariable logistic regression model
# Age-by-year
logitmod_mv_ageyrint <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag +
+ dx + func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent + age_cent*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_ageyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + +dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + age_cent * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3248 -0.9994 -0.7166 1.1070 1.9983
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.184e-03 8.363e-02 0.074 0.9411
## age_cent 3.493e-02 2.655e-03 13.159 < 2e-16 ***
## ci_bin1 -7.471e-02 1.624e-01 -0.460 0.6456
## calc_vent_use_bin1 4.939e-01 7.599e-02 6.499 8.07e-11 ***
## end_creat 5.128e-01 6.911e-02 7.420 1.17e-13 ***
## creatinc2_flag1 1.767e-01 3.326e-01 0.531 0.5952
## dxPulm Htx 1.733e-01 1.003e-01 1.729 0.0839 .
## dxCF 1.498e-01 6.096e-02 2.457 0.0140 *
## dxPulm fibrosis 1.645e-01 3.982e-02 4.130 3.62e-05 ***
## func_stat_trr_bin1 -1.989e-01 3.625e-02 -5.487 4.08e-08 ***
## calc_o2 -4.787e-03 3.109e-03 -1.540 0.1237
## walk_cent 3.153e-04 4.231e-05 7.452 9.17e-14 ***
## ytx_cent -2.158e-01 8.614e-03 -25.048 < 2e-16 ***
## age_cent:ytx_cent -3.829e-03 5.489e-04 -6.975 3.06e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18883 on 14939 degrees of freedom
## AIC: 18911
##
## Number of Fisher Scoring iterations: 4
# Cardiac index-by-year
logitmod_mv_ciyrint <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag +
+ dx + func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent + ci_bin*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_ciyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + +dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + ci_bin * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1186 -1.0143 -0.7017 1.0945 2.1970
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.369e-01 8.142e-02 1.681 0.0928 .
## age_cent 1.913e-02 1.344e-03 14.230 < 2e-16 ***
## ci_bin1 -2.485e-02 3.313e-01 -0.075 0.9402
## calc_vent_use_bin1 5.013e-01 7.602e-02 6.594 4.27e-11 ***
## end_creat 5.034e-01 6.898e-02 7.297 2.93e-13 ***
## creatinc2_flag1 1.760e-01 3.309e-01 0.532 0.5948
## dxPulm Htx 1.639e-01 1.001e-01 1.638 0.1015
## dxCF 1.532e-01 6.085e-02 2.518 0.0118 *
## dxPulm fibrosis 1.638e-01 3.974e-02 4.121 3.78e-05 ***
## func_stat_trr_bin1 -1.961e-01 3.620e-02 -5.416 6.09e-08 ***
## calc_o2 -4.724e-03 3.107e-03 -1.521 0.1284
## walk_cent 3.118e-04 4.223e-05 7.384 1.54e-13 ***
## ytx_cent -2.489e-01 7.324e-03 -33.988 < 2e-16 ***
## ci_bin1:ytx_cent -1.201e-02 6.656e-02 -0.180 0.8569
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18932 on 14939 degrees of freedom
## AIC: 18960
##
## Number of Fisher Scoring iterations: 4
# Mechanical ventilation-by-year
logitmod_mv_ventyrint <- glm(pstatus ~ age_cent + ci_bin +
calc_vent_use_bin +
end_creat + creatinc2_flag + dx + func_stat_trr_bin +
calc_o2 + walk_cent + ytx_cent +
calc_vent_use_bin*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_ventyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + calc_vent_use_bin * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9784 -1.0155 -0.6977 1.0913 2.1853
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.537e-01 8.164e-02 1.882 0.0598 .
## age_cent 1.917e-02 1.345e-03 14.257 < 2e-16 ***
## ci_bin1 -7.788e-02 1.623e-01 -0.480 0.6314
## calc_vent_use_bin1 1.566e-01 1.601e-01 0.978 0.3281
## end_creat 5.035e-01 6.899e-02 7.298 2.91e-13 ***
## creatinc2_flag1 1.771e-01 3.309e-01 0.535 0.5925
## dxPulm Htx 1.632e-01 1.001e-01 1.630 0.1030
## dxCF 1.501e-01 6.087e-02 2.467 0.0136 *
## dxPulm fibrosis 1.628e-01 3.974e-02 4.097 4.19e-05 ***
## func_stat_trr_bin1 -1.975e-01 3.622e-02 -5.452 4.97e-08 ***
## calc_o2 -4.738e-03 3.108e-03 -1.524 0.1275
## walk_cent 3.127e-04 4.223e-05 7.405 1.31e-13 ***
## ytx_cent -2.527e-01 7.456e-03 -33.897 < 2e-16 ***
## calc_vent_use_bin1:ytx_cent 8.106e-02 3.351e-02 2.419 0.0156 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18926 on 14939 degrees of freedom
## AIC: 18954
##
## Number of Fisher Scoring iterations: 4
# Creatinine-by-year
logitmod_mv_creatyrint <- glm(pstatus ~ age_cent + ci_bin +
calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin +
calc_o2 + walk_cent + ytx_cent +
end_creat*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_creatyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + end_creat * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1166 -1.0152 -0.7003 1.0936 2.1885
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.747e-01 1.286e-01 1.358 0.17439
## age_cent 1.912e-02 1.344e-03 14.224 < 2e-16 ***
## ci_bin1 -7.769e-02 1.624e-01 -0.479 0.63228
## calc_vent_use_bin1 5.013e-01 7.602e-02 6.594 4.28e-11 ***
## end_creat 4.594e-01 1.363e-01 3.371 0.00075 ***
## creatinc2_flag1 1.753e-01 3.310e-01 0.530 0.59631
## dxPulm Htx 1.645e-01 1.001e-01 1.643 0.10041
## dxCF 1.531e-01 6.085e-02 2.516 0.01187 *
## dxPulm fibrosis 1.637e-01 3.974e-02 4.120 3.80e-05 ***
## func_stat_trr_bin1 -1.961e-01 3.621e-02 -5.416 6.11e-08 ***
## calc_o2 -4.732e-03 3.107e-03 -1.523 0.12774
## walk_cent 3.121e-04 4.223e-05 7.391 1.45e-13 ***
## ytx_cent -2.579e-01 2.478e-02 -10.405 < 2e-16 ***
## end_creat:ytx_cent 1.048e-02 2.808e-02 0.373 0.70911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18932 on 14939 degrees of freedom
## AIC: 18960
##
## Number of Fisher Scoring iterations: 4
# Creatinine increase-by-year
logitmod_mv_creatincyrint <- glm(pstatus ~ age_cent + ci_bin +
calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent +
creatinc2_flag*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_creatincyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + creatinc2_flag * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1200 -1.0146 -0.7012 1.0944 2.1799
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.401e-01 8.136e-02 1.722 0.0850 .
## age_cent 1.912e-02 1.344e-03 14.226 < 2e-16 ***
## ci_bin1 -7.678e-02 1.624e-01 -0.473 0.6363
## calc_vent_use_bin1 5.012e-01 7.603e-02 6.593 4.32e-11 ***
## end_creat 5.028e-01 6.898e-02 7.290 3.11e-13 ***
## creatinc2_flag1 -5.346e-01 5.519e-01 -0.969 0.3327
## dxPulm Htx 1.633e-01 1.001e-01 1.632 0.1027
## dxCF 1.531e-01 6.086e-02 2.516 0.0119 *
## dxPulm fibrosis 1.643e-01 3.974e-02 4.135 3.55e-05 ***
## func_stat_trr_bin1 -1.959e-01 3.621e-02 -5.411 6.25e-08 ***
## calc_o2 -4.731e-03 3.107e-03 -1.523 0.1278
## walk_cent 3.118e-04 4.223e-05 7.384 1.54e-13 ***
## ytx_cent -2.497e-01 7.297e-03 -34.217 < 2e-16 ***
## creatinc2_flag1:ytx_cent 1.942e-01 1.233e-01 1.576 0.1151
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18929 on 14939 degrees of freedom
## AIC: 18957
##
## Number of Fisher Scoring iterations: 4
# Diagnosis-by-year
logitmod_mv_dxyrint <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent +
dx*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_dxyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + dx * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.104 -1.015 -0.698 1.095 2.159
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.982e-02 9.130e-02 0.984 0.32520
## age_cent 1.911e-02 1.345e-03 14.211 < 2e-16 ***
## ci_bin1 -7.635e-02 1.624e-01 -0.470 0.63818
## calc_vent_use_bin1 5.065e-01 7.610e-02 6.656 2.82e-11 ***
## end_creat 5.033e-01 6.901e-02 7.294 3.02e-13 ***
## creatinc2_flag1 1.822e-01 3.316e-01 0.550 0.58264
## dxPulm Htx 3.330e-01 2.005e-01 1.661 0.09675 .
## dxCF 4.928e-01 1.202e-01 4.099 4.14e-05 ***
## dxPulm fibrosis 1.683e-01 7.521e-02 2.238 0.02521 *
## func_stat_trr_bin1 -1.933e-01 3.623e-02 -5.335 9.54e-08 ***
## calc_o2 -4.636e-03 3.108e-03 -1.492 0.13582
## walk_cent 3.125e-04 4.225e-05 7.397 1.39e-13 ***
## ytx_cent -2.374e-01 1.287e-02 -18.445 < 2e-16 ***
## dxPulm Htx:ytx_cent -4.205e-02 4.267e-02 -0.985 0.32439
## dxCF:ytx_cent -8.368e-02 2.559e-02 -3.270 0.00108 **
## dxPulm fibrosis:ytx_cent -2.420e-03 1.587e-02 -0.153 0.87876
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18918 on 14937 degrees of freedom
## AIC: 18950
##
## Number of Fisher Scoring iterations: 4
# Functional status-by-year
logitmod_mv_funcstatyrint <- glm(pstatus ~ age_cent + ci_bin +
calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent +
func_stat_trr_bin*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_funcstatyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + func_stat_trr_bin * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1030 -1.0126 -0.6998 1.0956 2.2065
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.241e-01 9.368e-02 2.392 0.0168 *
## age_cent 1.917e-02 1.345e-03 14.254 < 2e-16 ***
## ci_bin1 -7.813e-02 1.624e-01 -0.481 0.6304
## calc_vent_use_bin1 4.984e-01 7.602e-02 6.557 5.50e-11 ***
## end_creat 5.044e-01 6.899e-02 7.311 2.65e-13 ***
## creatinc2_flag1 1.797e-01 3.312e-01 0.543 0.5874
## dxPulm Htx 1.628e-01 1.001e-01 1.626 0.1040
## dxCF 1.509e-01 6.087e-02 2.479 0.0132 *
## dxPulm fibrosis 1.642e-01 3.974e-02 4.130 3.62e-05 ***
## func_stat_trr_bin1 -3.203e-01 7.564e-02 -4.235 2.29e-05 ***
## calc_o2 -4.811e-03 3.107e-03 -1.548 0.1216
## walk_cent 3.124e-04 4.223e-05 7.398 1.38e-13 ***
## ytx_cent -2.678e-01 1.244e-02 -21.526 < 2e-16 ***
## func_stat_trr_bin1:ytx_cent 2.831e-02 1.512e-02 1.873 0.0611 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18928 on 14939 degrees of freedom
## AIC: 18956
##
## Number of Fisher Scoring iterations: 4
# o2-by-year
logitmod_mv_o2yrint <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent +
calc_o2*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_o2yrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + calc_o2 * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1224 -1.0146 -0.6999 1.0932 2.1839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.118e-01 9.187e-02 1.217 0.2237
## age_cent 1.912e-02 1.344e-03 14.225 < 2e-16 ***
## ci_bin1 -7.669e-02 1.623e-01 -0.472 0.6367
## calc_vent_use_bin1 5.009e-01 7.602e-02 6.589 4.43e-11 ***
## end_creat 5.037e-01 6.898e-02 7.302 2.84e-13 ***
## creatinc2_flag1 1.762e-01 3.310e-01 0.532 0.5944
## dxPulm Htx 1.641e-01 1.001e-01 1.640 0.1011
## dxCF 1.534e-01 6.085e-02 2.521 0.0117 *
## dxPulm fibrosis 1.638e-01 3.974e-02 4.122 3.75e-05 ***
## func_stat_trr_bin1 -1.957e-01 3.621e-02 -5.404 6.53e-08 ***
## calc_o2 -9.256e-04 7.034e-03 -0.132 0.8953
## walk_cent 3.119e-04 4.222e-05 7.388 1.49e-13 ***
## ytx_cent -2.435e-01 1.174e-02 -20.738 < 2e-16 ***
## calc_o2:ytx_cent -7.920e-04 1.316e-03 -0.602 0.5472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18931 on 14939 degrees of freedom
## AIC: 18959
##
## Number of Fisher Scoring iterations: 4
# Six minute walk-by-year
logitmod_mv_walkyrint <- glm(pstatus ~ age_cent + ci_bin +
calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
walk_cent + ytx_cent +
walk_cent*ytx_cent,
data=UNOSclean4, family="binomial")
summary(logitmod_mv_walkyrint)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## walk_cent + ytx_cent + walk_cent * ytx_cent, family = "binomial",
## data = UNOSclean4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0971 -1.0154 -0.6974 1.0935 2.2014
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.783e-01 8.722e-02 2.044 0.04093 *
## age_cent 1.912e-02 1.344e-03 14.222 < 2e-16 ***
## ci_bin1 -7.387e-02 1.624e-01 -0.455 0.64930
## calc_vent_use_bin1 5.021e-01 7.605e-02 6.603 4.03e-11 ***
## end_creat 5.044e-01 6.898e-02 7.312 2.64e-13 ***
## creatinc2_flag1 1.735e-01 3.313e-01 0.524 0.60037
## dxPulm Htx 1.644e-01 1.001e-01 1.642 0.10057
## dxCF 1.533e-01 6.085e-02 2.520 0.01175 *
## dxPulm fibrosis 1.642e-01 3.974e-02 4.131 3.61e-05 ***
## func_stat_trr_bin1 -1.960e-01 3.621e-02 -5.414 6.17e-08 ***
## calc_o2 -4.727e-03 3.107e-03 -1.521 0.12815
## walk_cent 2.200e-04 8.233e-05 2.673 0.00753 **
## ytx_cent -2.594e-01 1.079e-02 -24.029 < 2e-16 ***
## walk_cent:ytx_cent 2.222e-05 1.711e-05 1.299 0.19402
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20593 on 14952 degrees of freedom
## Residual deviance: 18930 on 14939 degrees of freedom
## AIC: 18958
##
## Number of Fisher Scoring iterations: 4
# Create a table of the variable-by-time interaction terms
intnum <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)
intname <- c("age * year",
"cardiac index * year",
"mechanical ventilation * year",
"creatinine * year",
"creatinine increase * year",
"pulmonary hypertension * year",
"cystic fibrosis * year",
"pulmonary fibrosis * year",
"functional status * year",
"oxygen need * year",
"six-minute-walk * year")
inttab2 <- as.data.frame(rbind(coef(summary(logitmod_mv_ageyrint))[14,],
coef(summary(logitmod_mv_ciyrint))[14,],
coef(summary(logitmod_mv_ventyrint))[14,],
coef(summary(logitmod_mv_creatyrint))[14,],
coef(summary(logitmod_mv_creatincyrint))[14,],
coef(summary(logitmod_mv_dxyrint))[14,],
coef(summary(logitmod_mv_dxyrint))[15,],
coef(summary(logitmod_mv_dxyrint))[16,],
coef(summary(logitmod_mv_funcstatyrint))[14,],
coef(summary(logitmod_mv_o2yrint))[14,],
coef(summary(logitmod_mv_walkyrint))[14,]))
inttab2 <- cbind(intnum, intname, inttab2)
names(inttab2) <- c("TermNum", "Interaction Term", "Estimate", "Std. Error", "z value", "Pr(>|z|)")
print(inttab2)## TermNum Interaction Term Estimate Std. Error z value Pr(>|z|)
## 1 1 age * year -3.828782e-03 5.489437e-04 -6.9748181 3.062668e-12
## 2 2 cardiac index * year -1.200571e-02 6.656463e-02 -0.1803617 8.568686e-01
## 3 3 mechanical ventilation * year 8.105943e-02 3.351018e-02 2.4189497 1.556539e-02
## 4 4 creatinine * year 1.047640e-02 2.808285e-02 0.3730534 7.091087e-01
## 5 5 creatinine increase * year 1.942235e-01 1.232568e-01 1.5757631 1.150804e-01
## 6 6 pulmonary hypertension * year -4.204892e-02 4.266870e-02 -0.9854745 3.243911e-01
## 7 7 cystic fibrosis * year -8.367537e-02 2.558970e-02 -3.2698842 1.075915e-03
## 8 8 pulmonary fibrosis * year -2.420498e-03 1.586789e-02 -0.1525407 8.787605e-01
## 9 9 functional status * year 2.831083e-02 1.511876e-02 1.8725632 6.112873e-02
## 10 10 oxygen need * year -7.919842e-04 1.315812e-03 -0.6018976 5.472423e-01
## 11 11 six-minute-walk * year 2.222277e-05 1.711039e-05 1.2987878 1.940168e-01
# Plot the p-values for the variable-by-time interaction terms (with reference line at p=0.1)
pvalint2 <- ggplot(data=inttab2, aes(x=TermNum, y=`Pr(>|z|)`)) + geom_point(size=2.5) +
geom_hline(yintercept=0.1, color="red")
# Format the axes
pvalint2 + scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
labels=c("Age", "Cardiac Index",
"Mechanical Ventilation", "Creatinine",
"Creatinine Increase >= 150%",
"Pulmonary Hypertension",
"Cystic Fibrosis",
"Pulmonary Fibrosis",
"Functional Status",
"Oxygen Requirement",
"Six minute walk distance")) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("Interaction Term") + ylab("p-value") +
ggtitle("Multivariable Logistic Model: p-values for variable-by-time \ninteraction terms, after adjusting for other LAS variables")# Also plot the p-values on log scale to distinguish between the low p-values
pvalint3 <- ggplot(data=inttab2, aes(x=TermNum, y=`Pr(>|z|)`)) + geom_point(size=2.5) +
geom_hline(yintercept=0.1, color="red") + scale_y_log10()
# Format the axes
pvalint3 + scale_x_continuous(breaks=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11),
labels=c("Age", "Cardiac Index",
"Mechanical Ventilation", "Creatinine",
"Creatinine Increase >= 150%",
"Pulmonary Hypertension",
"Cystic Fibrosis",
"Pulmonary Fibrosis",
"Functional Status",
"Oxygen Requirement",
"Six minute walk distance")) +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
xlab("Interaction Term") + ylab("p-value") +
ggtitle("Multivariable Logistic Model: p-values for variable-by-time interaction terms, \nafter adjusting for other LAS variables (on log scale)")The above plots suggest that after adjusting for the other variables in the LAS, the age-by-year, mechanical ventilation-by-year, and cystic fibrosis-by-year interaction terms are statistically significant at the \(\alpha\)=0.1 significance level, while the pulmonary hypertension-by-year interaction term is marginally statistically significant. Furthermore, when we plot these p-values on the log scale, we see that the p-value for the age-by-year interaction term is several orders of magnitude smaller than those for the other significant interaction terms.
Next, we plot the coefficient estimates for these significant interaction terms over time, so that we can visualize how the actual coefficent values change.
# Multivariable logistic regression model by year
logitmod_mv_2007 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2007,
family="binomial")
summary(logitmod_mv_2007)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2007)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1760 -1.2423 0.7244 0.8911 1.4790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.584385 0.308127 1.897 0.0579 .
## age_cent 0.037083 0.004797 7.731 1.07e-14 ***
## ci_bin1 0.661516 0.580570 1.139 0.2545
## calc_vent_use_bin1 -0.175393 0.296589 -0.591 0.5543
## end_creat 0.528868 0.244655 2.162 0.0306 *
## creatinc2_flag1 -0.083044 1.434049 -0.058 0.9538
## dxPulm Htx 0.241879 0.358246 0.675 0.4996
## dxCF 0.228473 0.208210 1.097 0.2725
## dxPulm fibrosis 0.049907 0.131623 0.379 0.7046
## func_stat_trr_bin1 -0.317834 0.150714 -2.109 0.0350 *
## calc_o2 -0.012131 0.015512 -0.782 0.4342
## calc_six_min_walk -0.000361 0.000145 -2.489 0.0128 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1702.3 on 1352 degrees of freedom
## Residual deviance: 1619.2 on 1341 degrees of freedom
## AIC: 1643.2
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2008 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2008,
family="binomial")
summary(logitmod_mv_2008)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2008)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1973 -1.2262 0.7469 0.9579 1.4701
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3895481 0.2816665 1.383 0.16666
## age_cent 0.0335442 0.0045468 7.378 1.61e-13 ***
## ci_bin1 -0.0416717 0.9253413 -0.045 0.96408
## calc_vent_use_bin1 0.3401964 0.2715639 1.253 0.21030
## end_creat 0.2853292 0.2280878 1.251 0.21095
## creatinc2_flag1 0.7524742 1.1258115 0.668 0.50389
## dxPulm Htx 1.0915564 0.4365885 2.500 0.01241 *
## dxCF 0.6436694 0.2092384 3.076 0.00210 **
## dxPulm fibrosis 0.2233890 0.1252620 1.783 0.07453 .
## func_stat_trr_bin1 -0.3767612 0.1375615 -2.739 0.00617 **
## calc_o2 -0.0211106 0.0118229 -1.786 0.07417 .
## calc_six_min_walk -0.0001543 0.0001424 -1.084 0.27857
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1802.8 on 1378 degrees of freedom
## Residual deviance: 1715.3 on 1367 degrees of freedom
## AIC: 1739.3
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2009 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2009,
family="binomial")
summary(logitmod_mv_2009)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2009)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8858 -1.2199 0.8042 1.0386 1.6950
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0611328 0.2382180 -0.257 0.797468
## age_cent 0.0275556 0.0041247 6.681 2.38e-11 ***
## ci_bin1 -0.4744927 0.5140554 -0.923 0.355987
## calc_vent_use_bin1 0.3234818 0.2427034 1.333 0.182588
## end_creat 0.3381901 0.2153967 1.570 0.116397
## creatinc2_flag1 -0.7136328 0.5979187 -1.194 0.232663
## dxPulm Htx -0.4589028 0.2939164 -1.561 0.118444
## dxCF 0.6480209 0.1832722 3.536 0.000406 ***
## dxPulm fibrosis 0.1603112 0.1172918 1.367 0.171697
## func_stat_trr_bin1 -0.1560963 0.1103401 -1.415 0.157161
## calc_o2 -0.0001065 0.0110668 -0.010 0.992321
## calc_six_min_walk -0.0002577 0.0001260 -2.046 0.040790 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2096.1 on 1539 degrees of freedom
## Residual deviance: 2017.8 on 1528 degrees of freedom
## AIC: 2041.8
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2010 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2010,
family="binomial")
summary(logitmod_mv_2010)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2010)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8572 -1.1904 0.8323 1.0735 1.6168
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3282533 0.2342446 -1.401 0.161116
## age_cent 0.0214326 0.0037865 5.660 1.51e-08 ***
## ci_bin1 -0.4005199 0.4464266 -0.897 0.369629
## calc_vent_use_bin1 0.6617622 0.2045110 3.236 0.001213 **
## end_creat 0.6333254 0.1990055 3.182 0.001460 **
## creatinc2_flag1 -0.8747853 1.2374559 -0.707 0.479615
## dxPulm Htx 0.4957567 0.2548049 1.946 0.051699 .
## dxCF 0.0024180 0.1774981 0.014 0.989131
## dxPulm fibrosis 0.2049398 0.1146956 1.787 0.073968 .
## func_stat_trr_bin1 -0.2259172 0.1068183 -2.115 0.034433 *
## calc_o2 0.0144712 0.0094815 1.526 0.126947
## calc_six_min_walk -0.0004443 0.0001225 -3.626 0.000288 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2305.6 on 1672 degrees of freedom
## Residual deviance: 2226.0 on 1661 degrees of freedom
## AIC: 2250
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2011 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2011,
family="binomial")
summary(logitmod_mv_2011)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2011)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5913 -1.0961 -0.8973 1.2076 1.6825
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6953401 0.2279787 -3.050 0.00229 **
## age_cent 0.0181942 0.0037221 4.888 1.02e-06 ***
## ci_bin1 0.2885919 0.4078693 0.708 0.47922
## calc_vent_use_bin1 0.5223740 0.2007506 2.602 0.00927 **
## end_creat 0.5120500 0.2029923 2.523 0.01165 *
## creatinc2_flag1 0.4217002 0.7886095 0.535 0.59283
## dxPulm Htx 0.3059532 0.2736612 1.118 0.26357
## dxCF 0.2560790 0.1676257 1.528 0.12659
## dxPulm fibrosis 0.2418006 0.1115166 2.168 0.03014 *
## func_stat_trr_bin1 -0.1303708 0.0989211 -1.318 0.18753
## calc_o2 -0.0104169 0.0090345 -1.153 0.24890
## calc_six_min_walk -0.0001697 0.0001206 -1.407 0.15950
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2398.4 on 1737 degrees of freedom
## Residual deviance: 2347.6 on 1726 degrees of freedom
## AIC: 2371.6
##
## Number of Fisher Scoring iterations: 4
# No one experienced a creatinine increase >=150% in 2012, so we cannot include this variable in the model for this year
logitmod_mv_2012 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2012,
family="binomial")
summary(logitmod_mv_2012)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + dx + func_stat_trr_bin + calc_o2 + calc_six_min_walk,
## family = "binomial", data = UNOS2012)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5875 -1.0331 -0.8461 1.2489 1.7816
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7189466 0.2269372 -3.168 0.00153 **
## age_cent 0.0154952 0.0039076 3.965 7.33e-05 ***
## ci_bin1 -0.5728183 0.5536408 -1.035 0.30084
## calc_vent_use_bin1 0.2244810 0.2112165 1.063 0.28787
## end_creat 0.7577101 0.1935696 3.914 9.06e-05 ***
## dxPulm Htx 0.1309153 0.2927522 0.447 0.65474
## dxCF -0.2859309 0.1825984 -1.566 0.11737
## dxPulm fibrosis 0.0754713 0.1163954 0.648 0.51672
## func_stat_trr_bin1 -0.4891960 0.1015619 -4.817 1.46e-06 ***
## calc_o2 -0.0043069 0.0086935 -0.495 0.62030
## calc_six_min_walk -0.0001542 0.0001221 -1.263 0.20649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2277.9 on 1677 degrees of freedom
## Residual deviance: 2215.9 on 1667 degrees of freedom
## AIC: 2237.9
##
## Number of Fisher Scoring iterations: 4
# No one experienced a creatinine increase >=150% in 2013, so we cannot include this variable in the model for this year
logitmod_mv_2013 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2013,
family="binomial")
summary(logitmod_mv_2013)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + dx + func_stat_trr_bin + calc_o2 + calc_six_min_walk,
## family = "binomial", data = UNOS2013)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5462 -0.9519 -0.8323 1.3261 1.8128
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7122719 0.2254624 -3.159 0.00158 **
## age_cent 0.0120501 0.0040202 2.997 0.00272 **
## ci_bin1 -0.2535361 0.4375442 -0.579 0.56228
## calc_vent_use_bin1 0.6099471 0.1895739 3.217 0.00129 **
## end_creat 0.4509160 0.1964810 2.295 0.02174 *
## dxPulm Htx -0.1701137 0.3053802 -0.557 0.57749
## dxCF -0.0762582 0.1778637 -0.429 0.66811
## dxPulm fibrosis 0.0778182 0.1154301 0.674 0.50021
## func_stat_trr_bin1 -0.1047534 0.1001335 -1.046 0.29550
## calc_o2 0.0014805 0.0078672 0.188 0.85073
## calc_six_min_walk -0.0005033 0.0001220 -4.127 3.68e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2356.8 on 1804 degrees of freedom
## Residual deviance: 2311.1 on 1794 degrees of freedom
## AIC: 2333.1
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2014 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2014,
family="binomial")
summary(logitmod_mv_2014)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2014)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7307 -0.8408 -0.7481 1.3888 2.0316
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2007918 0.2357386 -5.094 3.51e-07 ***
## age_cent 0.0115785 0.0039448 2.935 0.003334 **
## ci_bin1 -0.0502143 0.5369761 -0.094 0.925496
## calc_vent_use_bin1 0.6972706 0.1992724 3.499 0.000467 ***
## end_creat 0.5951333 0.1993419 2.985 0.002831 **
## creatinc2_flag1 2.3641292 1.1345418 2.084 0.037181 *
## dxPulm Htx -0.2758884 0.3627199 -0.761 0.446890
## dxCF -0.0497467 0.1944016 -0.256 0.798031
## dxPulm fibrosis -0.0085896 0.1240687 -0.069 0.944804
## func_stat_trr_bin1 -0.0815869 0.1063200 -0.767 0.442861
## calc_o2 -0.0027735 0.0080791 -0.343 0.731378
## calc_six_min_walk -0.0003735 0.0001263 -2.958 0.003101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2209.9 on 1837 degrees of freedom
## Residual deviance: 2164.7 on 1826 degrees of freedom
## AIC: 2188.7
##
## Number of Fisher Scoring iterations: 4
logitmod_mv_2015 <- glm(pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx +
func_stat_trr_bin + calc_o2 +
calc_six_min_walk, data=UNOS2015,
family="binomial")
summary(logitmod_mv_2015)##
## Call:
## glm(formula = pstatus ~ age_cent + ci_bin + calc_vent_use_bin +
## end_creat + creatinc2_flag + dx + func_stat_trr_bin + calc_o2 +
## calc_six_min_walk, family = "binomial", data = UNOS2015)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2788 -0.7836 -0.6912 -0.5124 2.0161
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.1951706 0.2505788 -4.770 1.85e-06 ***
## age_cent 0.0036386 0.0042317 0.860 0.389867
## ci_bin1 0.2118276 0.4243128 0.499 0.617621
## calc_vent_use_bin1 0.9854254 0.3503437 2.813 0.004912 **
## end_creat 0.3929320 0.2179505 1.803 0.071412 .
## creatinc2_flag1 0.4204009 0.8741959 0.481 0.630587
## dxPulm Htx 0.3597906 0.3076719 1.169 0.242244
## dxCF -0.0166907 0.2036387 -0.082 0.934677
## dxPulm fibrosis 0.3517214 0.1256772 2.799 0.005132 **
## func_stat_trr_bin1 -0.0617136 0.1073274 -0.575 0.565289
## calc_o2 -0.0185688 0.0084126 -2.207 0.027297 *
## calc_six_min_walk -0.0004655 0.0001333 -3.493 0.000478 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2187.0 on 1948 degrees of freedom
## Residual deviance: 2146.9 on 1937 degrees of freedom
## AIC: 2170.9
##
## Number of Fisher Scoring iterations: 4
# Create table of coefficients if significant interactions with time
# Age
mv_tab_age <- rbind(coef(summary(logitmod_mv_2007))[2,1:2],
coef(summary(logitmod_mv_2008))[2,1:2],
coef(summary(logitmod_mv_2009))[2,1:2],
coef(summary(logitmod_mv_2010))[2,1:2],
coef(summary(logitmod_mv_2011))[2,1:2],
coef(summary(logitmod_mv_2012))[2,1:2],
coef(summary(logitmod_mv_2013))[2,1:2],
coef(summary(logitmod_mv_2014))[2,1:2],
coef(summary(logitmod_mv_2015))[2,1:2])
mv_tab_age <- cbind(years, mv_tab_age)
mv_tab_age <- as.data.frame(mv_tab_age)
names(mv_tab_age) <- c("Year", "Estimate", "Std. Error")
# Mechanical Ventilation
mv_tab_vent <- rbind(coef(summary(logitmod_mv_2007))[4,1:2],
coef(summary(logitmod_mv_2008))[4,1:2],
coef(summary(logitmod_mv_2009))[4,1:2],
coef(summary(logitmod_mv_2010))[4,1:2],
coef(summary(logitmod_mv_2011))[4,1:2],
coef(summary(logitmod_mv_2012))[4,1:2],
coef(summary(logitmod_mv_2013))[4,1:2],
coef(summary(logitmod_mv_2014))[4,1:2],
coef(summary(logitmod_mv_2015))[4,1:2])
mv_tab_vent <- cbind(years[1:9, ], mv_tab_vent)
mv_tab_vent <- as.data.frame(mv_tab_vent)
names(mv_tab_vent) <- c("Year", "Estimate", "Std. Error")
# dxCF
mv_tab_dxCF <- rbind(coef(summary(logitmod_mv_2007))[8,1:2],
coef(summary(logitmod_mv_2008))[8,1:2],
coef(summary(logitmod_mv_2009))[8,1:2],
coef(summary(logitmod_mv_2010))[8,1:2],
coef(summary(logitmod_mv_2011))[8,1:2],
coef(summary(logitmod_mv_2012))[7,1:2], # CF="7" because creatinc2_flag not in model
coef(summary(logitmod_mv_2013))[7,1:2], # CF="7" because creatinc2_flag not in model
coef(summary(logitmod_mv_2014))[8,1:2],
coef(summary(logitmod_mv_2015))[8,1:2])
mv_tab_dxCF <- cbind(years, mv_tab_dxCF)
mv_tab_dxCF <- as.data.frame(mv_tab_dxCF)
names(mv_tab_dxCF) <- c("Year", "Estimate", "Std. Error")
# Plot these coefficients over time
# Age
mv1 <- ggplot(data=mv_tab_age, aes(x=Year, y=Estimate)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`,
ymax=Estimate+`Std. Error`), width=.2) +
geom_point()
# Format the axes
mv1 + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) +
ggtitle("Multivariable Logistic Model: Age Coefficient by Year, \n after adjusting for other LAS variables")# Exponentiate estimates to get AGE coefficient on Odds Ratio scale
mv_tab_age$expEst <- exp(mv_tab_age$Estimate)
mv_tab_age$expUpB <- exp(mv_tab_age$Estimate + mv_tab_age$`Std. Error`)
mv_tab_age$expLowB <- exp(mv_tab_age$Estimate - mv_tab_age$`Std. Error`)
# Plot the AGE coefficient over time (Odds Ratio Scale)
mv1b <- ggplot(data=mv_tab_age, aes(x=Year, y=expEst)) +
geom_errorbar(aes(ymin=expLowB, ymax=expUpB), width=.2) +
geom_point()
# Format the axes
mv1b + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) + ylab("Odds Ratio") +
ggtitle("Multivariable Logistic Model: Age Coefficient by Year \n(Odds Ratio Scale)")# Mechanical Ventilation
mv2 <- ggplot(data=mv_tab_vent, aes(x=Year, y=Estimate)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`,
ymax=Estimate+`Std. Error`), width=.2) +
geom_point()
# Format the axes
mv2 + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) +
ggtitle("Multivariable Logistic Model: Mechanical Ventilation Coefficient by Year, \n after adjusting for other LAS variables")# Exponentiate estimates to get MECHANICAL VENTILATION coefficient on OR scale
mv_tab_vent$expEst <- exp(mv_tab_vent$Estimate)
mv_tab_vent$expUpB <- exp(mv_tab_vent$Estimate +
mv_tab_vent$`Std. Error`)
mv_tab_vent$expLowB <- exp(mv_tab_vent$Estimate -
mv_tab_vent$`Std. Error`)
# Plot the MECHANICAL VENTILATION coefficient over time (Odds Ratio Scale)
mv2b <- ggplot(data=mv_tab_vent, aes(x=Year, y=expEst)) +
geom_errorbar(aes(ymin=expLowB, ymax=expUpB), width=.2) +
geom_point()
# Format the axes
mv2b + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) + ylab("Odds Ratio") +
ggtitle("Multivariable Logistic Model: Mechanical Ventilation Coefficient by Year \n(Odds Ratio Scale)")# Cystic Fibrosis
mv3 <- ggplot(data=mv_tab_dxCF, aes(x=Year, y=Estimate)) +
geom_errorbar(aes(ymin=Estimate-`Std. Error`,
ymax=Estimate+`Std. Error`), width=.2) +
geom_point()
# Format the axes
mv3 + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011,
2012, 2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) +
ggtitle("Multivariable Logistic Model: Cystic Fibrosis Coefficient by Year \n after adjusting for other LAS variables")# Exponentiate estimates to get CF coefficient on OR scale
mv_tab_dxCF$expEst <- exp(mv_tab_dxCF$Estimate)
mv_tab_dxCF$expUpB <- exp(mv_tab_dxCF$Estimate +
mv_tab_dxCF$`Std. Error`)
mv_tab_dxCF$expLowB <- exp(mv_tab_dxCF$Estimate -
mv_tab_dxCF$`Std. Error`)
# Plot the CF coefficient over time (Odds Ratio Scale)
mv3b <- ggplot(data=mv_tab_dxCF, aes(x=Year, y=expEst)) +
geom_errorbar(aes(ymin=expLowB, ymax=expUpB), width=.2) +
geom_point()
# Format the axes
mv3b + scale_x_continuous(breaks=c(2007, 2008, 2009, 2010, 2011, 2012,
2013, 2014, 2015),
labels=c("2007", "2008", "2009", "2010",
"2011", "2012", "2013", "2014",
"2015")) + ylab("Odds Ratio") +
ggtitle("Multivariable Logistic Model: Cystic Fibrosis \nCoefficient by Year (Odds Ratio Scale)")The above exploratory analyses suggest that the one year post-transplant mortality rate decreases over time, the prevalence of risk factors varies over time, and several coefficients of the LAS change significantly over time as well. More specifically, the age coefficient appears to decrease consistently over time, while the coefficients for mechanical ventilation and cystic fibrosis appear to oscillate over time. In light of these findings, we proceed to evaluate the calibration and discrimination of the current LAS, and compare these to the calibration and discrimination of a multivariable logistic regression model which includes the same main effects, along with predictor-by-time interaction terms for the age, mechanical ventialtion, and diagnosis group covariates.
Here, we display the results of the calibration and discrimination assessments of the current post-transplant LAS model. Note that for transplants occurring between 2005 and 2011, we use the coefficients from the 2005 LAS model, whereas for transplants occurring beetween 2012 and 2015, we use the coefficients from the 2010 LAS model. These coefficients were chosen because in clinical practice, the 2005 LAS was used during the years 2005-2011, while the 2010 LAS was used during the years 2012-2015. The baseline survival probability at one year for these models was taken to be the value reported by UNOS at time t=364 days. Calibration was assessed by obtaining predicted probabilities of death from the fitted LAS model, and then comparing these predicted probabilities to the observed proportion of deaths across deciles of predicted probabilities. Discrimination was assessed by plotting ROC curves, both overall, and separately by year.
# Load required packages
library(pROC)
library(ResourceSelection)
library(ggplot2)
library(dplyr)
library(Hmisc)
library(tidyverse)
library(boot)
# Recode the outcome variable
# (only count death if it occurs within 1 year of transplantation)
UNOSclean4$pstatus2 <- rep(NA, nrow(UNOSclean4))
for (i in 1:nrow(UNOSclean4)){
if (UNOSclean4$ptime[i]/365.25 <= 1){ # if event occurs within 1 year,
UNOSclean4$pstatus2[i] <- as.numeric(UNOSclean4$pstatus[i])-1 # leave status as is
} else { # if event occurs after 1 year,
UNOSclean4$pstatus2[i] <- 0 # replace status with 0
}
}# First set up indicator variables for each diagnosis group: ALREADY COMPLETED IN SIMULATED DATA
UNOSclean4$dxObstructive <- rep(0, nrow(UNOSclean4))
UNOSclean4$dxPulmHtx <- rep(0, nrow(UNOSclean4))
UNOSclean4$dxCF <- rep(0, nrow(UNOSclean4))
UNOSclean4$dxPulmFib <- rep(0, nrow(UNOSclean4))
for (i in 1:nrow(UNOSclean4)){
if (UNOSclean4$diag_group[i]==1){
UNOSclean4$dxObstructive[i] <- 1
} else if (UNOSclean4$diag_group[i]==2){
UNOSclean4$dxPulmHtx[i] <- 1
} else if (UNOSclean4$diag_group[i]==3){
UNOSclean4$dxCF[i] <- 1
} else if (UNOSclean4$diag_group[i]==4){
UNOSclean4$dxPulmFib[i] <- 1
}
}# Compute predicted log hazard ratios of SURVIVAL at each year using LAS coefficients reported by UNOS
pred_logHR <- rep(NA, nrow(UNOSclean4)) # Initialize vector to store predicted log HRs
for (i in 1:nrow(UNOSclean4)){
if (UNOSclean4$dxObstructive[i]==1){ # For diagnosis group A
pred_logHR[i] <- 0.0247*(UNOSclean4$age_cent) +
0.3499*(as.numeric(as.character(UNOSclean4$ci_bin[i]))) +
0.6094*(as.numeric(as.character(UNOSclean4$calc_vent_use_bin[i]))) +
0.0896*(UNOSclean4$end_creat[i]) +
0.7709*(as.numeric(as.character(UNOSclean4$creatinc2_flag[i]))) +
0.6116*(UNOSclean4$dxPulmHtx[i]) +
0.3627*(UNOSclean4$dxCF[i]) +
0.4641*(UNOSclean4$dxPulmFib[i]) +
-0.1900*(as.numeric(as.character(UNOSclean4$func_stat_trr_bin[i]))) +
0.0748*(UNOSclean4$calc_o2[i]) +
0.0005*(UNOSclean4$walk_cent[i])
} else{ # For diagnosis groups B, C, D
pred_logHR[i] <- 0.0247*(UNOSclean4$age_cent) +
0.3499*(as.numeric(as.character(UNOSclean4$ci_bin[i]))) +
0.6094*(as.numeric(as.character(UNOSclean4$calc_vent_use_bin[i]))) +
0.0896*(UNOSclean4$end_creat[i]) +
0.7709*(as.numeric(as.character(UNOSclean4$creatinc2_flag[i]))) +
0.6116*(UNOSclean4$dxPulmHtx[i]) +
0.3627*(UNOSclean4$dxCF[i]) +
0.4641*(UNOSclean4$dxPulmFib[i]) +
-0.1900*(as.numeric(as.character(UNOSclean4$func_stat_trr_bin[i]))) +
0.0164*(UNOSclean4$calc_o2[i]) +
0.0005*(UNOSclean4$walk_cent[i])
}
}
# Convert the predicted log hazard ratios of survival into hazard ratios, and raise the baseline post-transplant
# survival probability at 364 days to this power to get 1-year post-transplant survival probabilities
pred_HR_LAS <- exp(pred_logHR)
Stx0_364_2015 <- 0.941315 # Reported 2015 LAS Baseline post-transplant survival probability
# Apply baseline survival probability to predicted hazard ratios to get predicted survival probability
pred_surv_LAS <- as.data.frame(cbind(UNOSclean4$ytx, UNOSclean4$pstatus2, pred_HR_LAS))
names(pred_surv_LAS) <- c("ytx", "pstatus", "pred_HR_LAS")
# Initialize vector to store probabilities
pred_surv_LAS$pred_prob_LASsurv <- rep(NA, nrow(pred_surv_LAS))
# Predicted 1-year post-transplant survival probability
for (i in 1:nrow(pred_surv_LAS)){
pred_surv_LAS$pred_prob_LASsurv[i] <- Stx0_364_2015^(pred_surv_LAS$pred_HR_LAS[i])
}
# Convert the predicted survival probabilities into predicted death probabilities
pred_prob_LAS <- 1 - pred_surv_LAS$pred_prob_LASsurv
# Create matrix with year, pstatus2, predicted probability of death;
# remove observations with missing probabilities
predmatLAS <- as.data.frame(cbind(UNOSclean4$ytx, UNOSclean4$pstatus2, pred_prob_LAS))
predmatLAS2 <- predmatLAS[which(!is.na(pred_prob_LAS)),]
names(predmatLAS2) <- c("year", "pstatus", "predprobLAS")
# Plot a histogram of predicted probabilities of death by outcome status (Count)
ggplot(predmatLAS2, aes(x=predprobLAS)) +
geom_histogram(bins=30) +
facet_grid(~pstatus) +
xlab("Predicted Probability of Death") +
ylab("Count") +
ggtitle("Histogram of Predicted Probabilities of death (from LAS) by Outcome \n(1=Dead, 0=Alive)")# Plot a histogram of predicted probabilities by outcome status (Percent)
ggplot(predmatLAS2, aes(x=predprobLAS)) +
geom_histogram(aes(y=(..count../sum(..count..))*100), bins=30) +
facet_grid(~pstatus) +
xlab("Predicted Probability of Death (from LAS)") +
ylab("Percent") +
ggtitle("Histogram of Predicted Probabilities of death (from LAS) by Outcome \n(1=Dead, 0=Alive)")# Compute the Brier Score
LAS.BrierScore <- mean((predmatLAS2$predprob-predmatLAS2$pstatus)^2, na.rm=TRUE)
print(LAS.BrierScore)## [1] 0.09663103
# Hosmer-Lemeshow Goodness of Fit Test, using 10 groups (i.e., deciles) of predicted probabilities
hl_gof2 <- hoslem.test(predmatLAS2$pstatus, predmatLAS2$predprobLAS, g=10)
hl_gof2##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: predmatLAS2$pstatus, predmatLAS2$predprobLAS
## X-squared = 419.21, df = 8, p-value < 2.2e-16
# Plot Observed Proportion of Deaths vs. Average Predicted Probability for each decile group from Hosmer-Lemeshow test
# Create decile groups based on predicted probabilities
predmatLAS2$group <- with(predmatLAS2, cut(predprobLAS,
quantile(predprobLAS, probs=seq(0,1, by=0.1), na.rm=TRUE),
include.lowest=TRUE))
# Average the predicted probabilities within each decile group
avepp_tabLAS <- predmatLAS2 %>% dplyr::group_by(group) %>% dplyr::summarise(avepp = mean(predprobLAS))
# Compute the observed proportion of deaths in each decile group
obsp_tabLAS<- predmatLAS2 %>% dplyr::group_by(group) %>% dplyr::summarise(obsp = mean(pstatus))
# Merge the observed and predicted probability tables together
oeprob_tabLAS <- full_join(avepp_tabLAS, obsp_tabLAS, by="group")
print(oeprob_tabLAS)## # A tibble: 10 x 3
## group avepp obsp
## <fct> <dbl> <dbl>
## 1 [0.0468,0.0953] 0.0816 0.0836
## 2 (0.0953,0.11] 0.103 0.0990
## 3 (0.11,0.121] 0.115 0.0923
## 4 (0.121,0.132] 0.126 0.116
## 5 (0.132,0.142] 0.137 0.101
## 6 (0.142,0.153] 0.148 0.102
## 7 (0.153,0.167] 0.160 0.0950
## 8 (0.167,0.185] 0.175 0.114
## 9 (0.185,0.217] 0.199 0.103
## 10 (0.217,0.648] 0.276 0.114
# Create a scatter plot of the observed proportions vs. average predicted probabilities of death
ggplot(data=oeprob_tabLAS, aes(x=avepp, y=obsp)) +
geom_point(color="red") +
geom_abline(slope=1, intercept=0) +
xlim(0,1) +
ylim(0,1) +
xlab("Average Predicted Probabilities of Death") +
ylab("Observed Proportion of Deaths") +
ggtitle("Observed vs. Expected Probabilities of Death \nfrom LAS")When we use the coefficients of the 2015 LAS for all years, the Hosmer-Lemeshow statistic is 95.8 on 8 degrees of freedom, and the p-value is less than 2.2x10^-16. In both cases, the p-value is less than the significance level of \(\alpha\)=0.05, so we reject the null hypothesis, and conclude that the calibration of the LAS model is inadequate. More specifically, the plot of the observed proportion of deaths versus average predicted probabilities of death across deciles of predicted probabilities indicates that for all but the highest decile, the LAS underestimates the probability of death (or, equivalently, the LAS overestimates the probability of survival).
Next, we evaluate the discriminative ability of the LAS model by plotting ROC curve:
# ROC Curve comparing observed outcome vs. LAS Prediction
# Overall
roc(UNOSclean4$pstatus2, pred_prob_LAS, ci=TRUE)##
## Call:
## roc.default(response = UNOSclean4$pstatus2, predictor = pred_prob_LAS, ci = TRUE)
##
## Data: pred_prob_LAS in 13428 controls (UNOSclean4$pstatus2 0) < 1525 cases (UNOSclean4$pstatus2 1).
## Area under the curve: 0.5181
## 95% CI: 0.5029-0.5332 (DeLong)
plot.roc(UNOSclean4$pstatus2, pred_prob_LAS, ci=TRUE, legacy.axes=TRUE, xlim=c(1,0), print.auc=1)#title("Observed Outcome vs. LAS Prediction")When the LAS is used to predict outcomes, the area under the overall ROC curve is estimated to be 0.578 (95% confidence interval: 0.564-0.593). This result implies that the LAS model exhibits poor discrimination.
For comparison, we also assessed the calibration and discrimination of a multivariable logistic regression model which contained the same main effects as the current LAS, but which additionally included interactions between key predictors and time (as identified in Analyses: Part 1). 10-fold cross-validation was performed to avoid overfitting this model.
# Multivariable logistic regression model with predictor-by-year interactions ("True" Model)
logitmod_mv <- glm(pstatus2 ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx + func_stat_trr_bin +
calc_o2 + walk_cent + ytx_cent +
age_cent*ytx_cent +
calc_vent_use_bin*ytx_cent +
dx*ytx_cent,
data=UNOSclean4, family="binomial")
# Obtain predicted probabilities of death from multivariable logistic model
logitmod_predprob <- predict(logitmod_mv, UNOSclean4, type="response")
#UNOSclean4 <- UNOSclean4[, -c(19, 20, 21, 22, 23, 24)] # Remove columns named NA
# Also obtain predicted probabilities using K-fold cross-validation for comparison
N <- nrow(UNOSclean4)
K <- 10 # 10-fold cross-validation
set.seed(1234)
s <- sample(1:K, size=N, replace=T)
pred.outputs.logistic <- vector(mode="numeric", length=N)
obs.outputs <- vector(mode="numeric", length=N)
obs.year <- vector(mode="numeric", length=N)
offset <- 0
for (i in 1:K){
train <- filter(UNOSclean4, s != i)
test <- filter(UNOSclean4, s == i)
obs.outputs[1:length(s[s==i]) + offset] <- test$pstatus2
obs.year[1:length(s[s==i]) + offset] <- test$ytx
# Logistic train/test
logistic <- glm(pstatus2 ~ age_cent + ci_bin + calc_vent_use_bin +
end_creat + creatinc2_flag + dx + func_stat_trr_bin +
calc_o2 + walk_cent + ytx_cent +
age_cent*ytx_cent +
calc_vent_use_bin*ytx_cent +
dx*ytx_cent,
data=train, family=binomial(logit))
logistic.pred.curr <- predict(logistic, test, type="response")
pred.outputs.logistic[1:length(s[s==i]) + offset] <- logistic.pred.curr
offset <- offset + length(s[s==i])
}
# Use cross-validated predicted probabilities in all subsequent analyses
# to mitigate problem of overfitting
# Create matrix containing year, pstatus2, & predicted probability;
# remove observations with missing probabilities
predmat <- as.data.frame(cbind(obs.year, obs.outputs, pred.outputs.logistic))
predmat2 <- predmat[which(!is.na(pred.outputs.logistic)),]
names(predmat2) <- c("year", "pstatus", "predprob")
# Plot a histogram of predicted probabilities by outcome status (Count)
ggplot(predmat2, aes(x=predprob)) +
geom_histogram(bins=30) +
facet_grid(~pstatus) +
xlab("Predicted Probability of Death") +
ylab("Count") +
ggtitle("Histogram of Predicted Probabilities by Outcome \n(1=Dead, 0=Alive)")# Plot a histogram of predicted probabilities by outcome status (Percent)
ggplot(predmat2, aes(x=predprob)) +
geom_histogram(aes(y=(..count../sum(..count..))*100), bins=30) +
facet_grid(~pstatus) +
xlab("Predicted Probability of Death") +
ylab("Percent") +
ggtitle("Histogram of Predicted Probabilities by Outcome \n(1=Dead, 0=Alive)")# Compute the Brier Score
logitmod.BrierScore <- mean((predmat2$predprob-predmat2$pstatus)^2, na.rm=TRUE)
print(logitmod.BrierScore)## [1] 0.09129411
# Hosmer-Lemeshow Goodness of Fit Test, using 10 groups (i.e., deciles) of predicted probabilities
hl_gof1 <- hoslem.test(predmat2$pstatus, predmat2$predprob, g=10)
hl_gof1##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: predmat2$pstatus, predmat2$predprob
## X-squared = 24.435, df = 8, p-value = 0.001937
# Print table of observed vs. expected outcomes
oe_data1 <- as.data.frame(cbind(hl_gof1$observed, hl_gof1$expected))
print(oe_data1)## y0 y1 yhat0 yhat1
## [0.0279,0.0745] 1384 112 1396.379 99.62121
## (0.0745,0.0824] 1383 112 1377.170 117.82984
## (0.0824,0.0884] 1368 127 1367.117 127.88311
## (0.0884,0.0938] 1336 159 1358.837 136.16316
## (0.0938,0.0992] 1364 132 1351.708 144.29163
## (0.0992,0.105] 1319 176 1342.692 152.30794
## (0.105,0.111] 1353 142 1333.952 161.04811
## (0.111,0.119] 1305 190 1323.196 171.80367
## (0.119,0.132] 1307 188 1308.191 186.80862
## (0.132,0.286] 1309 187 1269.077 226.92347
# Plot Observed Proportion of Deaths vs. Average Predicted Probability for each decile group from Hosmer-Lemeshow test
# Create decile groups based on predicted probabilities
predmat2$group <- with(predmat2, cut(predprob,
quantile(predprob, probs=seq(0,1, by=0.1), na.rm=TRUE),
include.lowest=TRUE))
# Average the predicted probabilities within each decile group
avepp_tab <- predmat2 %>% group_by(group) %>% summarise(avepp = mean(predprob))
# Compute the observed proportion of deaths in each decile group
obsp_tab <- predmat2 %>% group_by(group) %>% summarise(obsp = mean(pstatus))
# Merge the observed and predicted probability tables together
oeprob_tab <- full_join(avepp_tab, obsp_tab, by="group")
print(oeprob_tab)## # A tibble: 10 x 3
## group avepp obsp
## <fct> <dbl> <dbl>
## 1 [0.0279,0.0745] 0.0666 0.0749
## 2 (0.0745,0.0824] 0.0788 0.0749
## 3 (0.0824,0.0884] 0.0855 0.0849
## 4 (0.0884,0.0938] 0.0911 0.106
## 5 (0.0938,0.0992] 0.0965 0.0882
## 6 (0.0992,0.105] 0.102 0.118
## 7 (0.105,0.111] 0.108 0.0950
## 8 (0.111,0.119] 0.115 0.127
## 9 (0.119,0.132] 0.125 0.126
## 10 (0.132,0.286] 0.152 0.125
# Create a scatter plot of the observed proportions vs. average predicted probabilities
ggplot(data=oeprob_tab, aes(x=avepp, y=obsp)) +
geom_point(color="red") +
geom_abline(slope=1, intercept=0) +
xlim(0,1) +
ylim(0,1) +
xlab("Average Predicted Probabilities of Death") +
ylab("Observed Proportion of Deaths") +
ggtitle("Observed vs. Expected Probabilities of Death \nfrom Logistic Model")The Hosmer-Lemeshow statistic for the logistic regression model is 16.079 on 8 degrees of freedom, and the corresponding p-value is 0.0413. Since this p-value is less than the significance level of \(\alpha\)=0.05, we reject the null hypothesis, and conclude that the multivariable logistic model provides an inadequate fit to the data. Note that this fit is, however, better than the one provided by the LAS.
Plotting the observed proportion of deaths versus the average predicted probabilities across deciles of predicted probability for all years yields data which lie approximately on a straight line through the origin with slope 1 (i.e, a 45-degree line). This result suggests that overall, the multivariable logistic model is marginally adequate.
Next, we plot the ROC curve for the multivariable logistic model:
# ROC Curve comparing predicted vs. observed outcomes
# Overall
roc(UNOSclean4$pstatus2, logitmod_predprob, ci=TRUE)##
## Call:
## roc.default(response = UNOSclean4$pstatus2, predictor = logitmod_predprob, ci = TRUE)
##
## Data: logitmod_predprob in 13428 controls (UNOSclean4$pstatus2 0) < 1525 cases (UNOSclean4$pstatus2 1).
## Area under the curve: 0.571
## 95% CI: 0.556-0.5859 (DeLong)
plot.roc(UNOSclean4$pstatus2, logitmod_predprob, ci=TRUE,
legacy.axes=TRUE, xlim=c(1,0), print.auc=1,
print.auc.x=1.3, print.auc.y=0.8) # All data
roc(obs.outputs, pred.outputs.logistic, ci=TRUE)##
## Call:
## roc.default(response = obs.outputs, predictor = pred.outputs.logistic, ci = TRUE)
##
## Data: pred.outputs.logistic in 13428 controls (obs.outputs 0) < 1525 cases (obs.outputs 1).
## Area under the curve: 0.5558
## 95% CI: 0.5407-0.5708 (DeLong)
plot.roc(obs.outputs, pred.outputs.logistic, ci=TRUE, legacy.axes=TRUE,
xlim=c(1,0), print.auc=1,
print.auc.x=0.4, print.auc.y=0.5,
add=TRUE, col="red") # cv
#title("Observed Outcome vs. Logistic Prediction")
legend("bottomright", legend=c("Training", "Cross-Validation"), col=c("black", "red"), lwd=2)The area under the “overall” ROC curve after 10-fold cross-validation is estimated to be 0.616 (95% confidence interval: 0.601-0.631). This result implies that the multivariable logistic model has better discrimination than the LAS, although it is still quite poor.
Finally, we overlay the ROC curve from the LAS for ease of comparison:
# ROC Curve comparing observed outcome vs. Logistic Prediction and LAS Prediction
# Overall
plot.roc(predmat2$pstatus, predmat2$predprob, ci=TRUE,
legacy.axes=TRUE, xlim=c(1,0),
print.auc=1, print.auc.x=1.3, print.auc.y=0.8,
col="red")
plot.roc(UNOSclean4$pstatus2, pred_prob_LAS, ci=TRUE,
legacy.axes=TRUE, xlim=c(1,0),
print.auc=1, print.auc.x=0.4, print.auc.y=0.5,
col="blue", add=TRUE)
#title("Observed vs. Predicted Outcomes")
legend("bottomright", legend=c("CV Logistic", "LAS"), col=c("red", "blue"), lwd=2)The analyses shown here suggest that the post-transplant mortality rate decreases over time, the prevalence of risk factors varies over time, and the LAS coefficient values change over time. None of these changes are accounted for by the current LAS. Additionally, when comparing observed versus predicted mortality for both the LAS and our basic prediction model, we found that the former model exhibits worse calibration and worse discrimination than the latter. More specifically, the LAS tended to overestimate the probability of survival.
Taken together, these results are consistent with the notion that the LAS is “broken” [8, 9], and provide support for a dynamic prediction modelling strategy [10]. Such a strategy, however, will likely need to be more nuanced than simply allowing top predictors to vary with time, as our naive prediction model which included age-by-time, mechanical ventilation-by-time, and diagnosis-by-time interaction terms did not perform substantially better than the existing LAS.
Egan TM, Murray S, Bustami RT et al. Development of the New Lung Allocation System in the United States. American Journal of Transplantation 2006; 6 (Part 2): 1212-1227.
Gottlieb J. Lung allocation. J Thorac Dis 2017; 9(8): 2670-2674.
Maxwell BG, Mooney JJ, Lee PHU, et al. Increased Resource Use in Lung Transplant Admissions in the Lung Allocation Score Era. Am J Respir Crit Care Med 2015; 191(3): 302-308.
Russo MJ, Meltzer D, Merlo A, et al. Local Allocation of Lung Donors Results in Transplanting Lungs in Lower Priority Transplant Recipients. Ann Thorac Surg 2013; 95(4): 1231-1234.
Wille KM, Harrington KF, Andrade JA, et al. Disparities in lung transplantation before and after introduction of the lung allocation score. J Heart Lung Transplant 2013; 32(7): 684-692.
Thabut G, Munson J, Haynes K, et al. Geographic Disparities in Access to Lung Transplantation Before and After Implementation of the Lung Allocation Score. Am J Transplant 2012; 12(11): 3085-3093.
United Network for Organ Sharing (2015). A Guide to Calculating the Lung Allocation Score. Available here.
Halpern S. Beyond “The LAS is Broken”: Ways to Improve Lung Allocation. American Journal of Respiratory and Critical Care Medicine 2015; 191(3): 245-246.
Gries CJ, Rue TC, Heagerty PJ, et al. Development of a predictive model for long-term survival after lung transplantation and implications for the lung allocation score. J Heart Lung Transplant 2010; 29(7): 731-738.
Steyerberg EW. Updating for a New Setting. In: Clinical Prediction Models. A Practical Approach to Development, Validation, and Updating. New York: Springer Science+Business Media, LLC. 2010.